Author: https://twitter.com/ldl19691031

For the global pipeline of Unreal Engine 5, please check this: https://www.notion.so/Brief-Analysis-of-UE5-Rendering-Pipeline-feedcb9174aa4af2af936fbb02a9e390

In this article, I will write my own analysis of Nanite's source code. I only want to touch on the basic progress but not try to analyze the ideas behind the Nanite. This much beyond my skill. For the ideas behind Nanite, I recommend waiting for the official SIGGRAPH presentation.

This is the offical page of Nanite : https://docs.unrealengine.com/5.0/en-US/RenderingFeatures/Nanite/

Also, this is also anther good analysis of Nanite: http://www.elopezr.com/a-macro-view-of-nanite/

And the official Inside Unreal talk: https://www.youtube.com/watch?v=TMorJX3Nj6U

WARNING: I'm not a developer of Epic, and this article may contains mistakes and mislead you. This article focus on the details of Nanite's source so I recommend you to at least watch the videos of Inside Unreal's Nanite talk.

## Basic concepts

Nanite contains two parts: a pre-calculation system for preprocessing the meshes, generate mesh clusters and LODs; and a runtime system for loading and rendering.

So, the clusters will be the basic elements of a Nanite mesh. During the rendering, Nanite needs to:

Culling: both instance culling and primitive culling

Change the LOD: Nanite changes LOD level cluster-wise, which is different from many instance-wise LOD approaches.
You can see only some of the clusters have changed:

Rendering: By hardware and software rasterizer ( I discussed a little in the rendering pipeline analysis)

In order to explain the mini-map of Nanite, I created this image:

First, Nanite generates a set of clusters based on the triangles, whichi becomes the leaf clusters.

Then, it will select a set of clusters as a new cluster group, and merge this group into a large cluster.(2 and 3)

Then, simplify this cluster, reduce the polygons but keep the boundary.(4)

Then, do a graph partition to split this large cluster into small parts. The parts are larger than the original clusters in (2)

You can see cluster set B has the same boundary as a cluster set A. This means we can switch from A to B to reduce polygon without creating any seam on the boundary.

So, when can we apply this switch safely? Nanite creates a depenency graph (DAG) to repesent (6)

Finally, we break cluster set B and got a set of clusters again. Repeat these steps until the total cluster number less than a given threshold.

So, in theory, switching LODs will never cause some empty space in the mesh's surface, because every switch will be boundary safe.

## Pre-processing

The pre-processing happens in ue's editor. When you enable Nanite for a static mesh, Nanite will build cluster data for this mesh and encode them, write to disk.

This processing happens in BuildNaniteData function in NaniteBuilder.cpp.

The pre-processing outputs are an array of clusters and an array of goups.

### Cluster Data Structure

So, let's first look at the data that each cluster holds:

C++

复制

TArray< float > Verts;
TArray< uint32 > Indexes;
FBounds Bounds;
FSphere SphereBounds;
FSphere LODBounds;
uint32 GroupIndex = MAX_uint32;
uint32 GroupPartIndex = MAX_uint32;
uint32 GeneratingGroupIndex= MAX_uint32;

As a real example, this is a cluster data:

You can see, each cluster contains its own triangles and bounds. Also, it contains its boundary edges and external edges.

Beyond this, we have cluster groups, which contains a set of cluster groups:

C++

复制

FSphere Bounds;
FSphere LODBounds;
float MinLODError;
float MaxParentLODError;
int32 MipLevel;
uint32 MeshIndex;
TArray< uint32 > Children;

#### Dependency

So, with a cluster group as middle data, two levels of clusters' reference is like this:

Be careful, I didn't say Cluster Group B will have a direct link to Cluster Group A. Actually, the generating group index is used as a data dependency. They are not organized as a tree by this info, it is much more complex than this. DO NOT mislead by this picture!

#### LOD Error

In each group, it will calculate the LOD error. This part is really important and will answer the question that why we do not need a direct reference between Cluster Group A and B.

C++

复制

// Cluster
float LODError = 0.0f;
// Cluster Group
float MinLODError;
float MaxParentLODError;

For a cluster, the LOD error is just a number, but for a cluster group, the LOD error is a range [MinLODError, MaxParentLODError].

For the LOD error calculation:

As we explained before, from cluster group A to cluster set B, we will do: merge, simplify, split.

So, let's assume our cluster group A has 3 clusters, the LOD error will be 0.1, 0.3, 0.5. Then they are merged, and simplified, split into two new clusters on the right. Considering the simplification, the error may be higher than before.

So, for the left part, the cluster group will take find the minimal error from the children, in our case, it is 0.1. Then, from the generated new simplified clusters, it finds the max error and forces all the generated new clusters to have the same LOD error as this max error. This is important!

Now, for the left cluster group, the range will be [0.1, 0.7]. And, if any other group (for example, cluster group B) contains the two newly generated clusters, the MaxParentLODError will be larger than the cluster group A's MaxParentLODError. Please remember this, I will use this in the next part.

## Encoding

The pre-processed Nanite data ( clusters and cluster groups) will be encoded and send to GPU memory for accessing by computing shaders.

The encoding code is the function Nanite::Encode in NaniteEncode.cpp.

Now we need to deal with two different kinds of data: the clusters, which contain triangles and more, are much more complex and huge, and the groups which are references and small.

#### Cluster Encoding

For the clusters, Nanite encodes them into pages. The size of a page is fixed, so a group may be split into different pages. To achieve this, Nanite uses FClusterGroupPart to record.

Page Assign

When a new page allocated, or we start recording a new group, a new cluster group part will be created. And the clusters will be set to the cluster group parts.

(a green box is a cluster)

Hierarchies build

First we need to know the max mip level by just find the highest mip level number of all groups.

But in the same mip level, we may still have many nodes:

So, instead of just use these level-based nodes, Nanite build a BVH tree to organize them into a tree of nodes that contains much smaller children for each node. In my test, the max children number is 8.

After this step, in my test, 138 nodes are added as BVH node to group those nodes.

If you enable the BVH_BUILD_WRITE_GRAPHVIZ macro, you can visualize the tree like this.

Since the detail info is not so important, I just show a small screenshot.

Be careful, this tree is NOT the same tree of cluster groups and clusters. This is a tree based on ClusterGroupPart.

#### Pack Hierarchies

Nanite gets all the hierarchy nodes( In my test, the number is 139 (138 + 1) ) and packed them into FPackedHierarchyNode for GPU to access.

C++

复制

struct FPackedHierarchyNode
{
FSphere LODBounds[MAX_BVH_NODE_FANOUT];
struct
{
FVector BoxBoundsCenter;
uint32 MinLODError_MaxParentLODError;
} Misc0[MAX_BVH_NODE_FANOUT];
struct
{
FVector BoxBoundsExtent;
uint32 ChildStartReference;
} Misc1[MAX_BVH_NODE_FANOUT];
struct
{
uint32 ResourcePageIndex_NumPages_GroupPartSize;
} Misc2[MAX_BVH_NODE_FANOUT];
};

Just a little about 'for GPU to access': FPackedHierarchyNode data will be upload to the HierarchyBuffer in GPU memory. Then decode by this code

C++

复制

FHierarchyNodeSlice GetHierarchyNodeSlice( uint NodeIndex, uint ChildIndex )
{
const uint NodeSize = ( 4 + 4 + 4 + 1 ) * 4 * MAX_BVH_NODE_FANOUT;
uint BaseAddress = NodeIndex * NodeSize;
FHierarchyNodeSlice Node;
Node.LODBounds = asfloat( HierarchyBuffer.Load4( BaseAddress + 16 * ChildIndex) );
uint4 Misc0 = HierarchyBuffer.Load4( BaseAddress + (MAX_BVH_NODE_FANOUT * 16) + 16 * ChildIndex);
uint4 Misc1 = HierarchyBuffer.Load4( BaseAddress + (MAX_BVH_NODE_FANOUT * 32) + 16 * ChildIndex);
uint Misc2 = HierarchyBuffer.Load ( BaseAddress + (MAX_BVH_NODE_FANOUT * 48) + 4 * ChildIndex);
Node.BoxBoundsCenter = asfloat( Misc0.xyz );
Node.BoxBoundsExtent = asfloat( Misc1.xyz );
Node.MinLODError = f16tof32( Misc0.w );
Node.MaxParentLODError = f16tof32( Misc0.w >> 16 );
Node.ChildStartReference= Misc1.w;
Node.bLoaded = Misc1.w != 0xFFFFFFFFu;
uint ResourcePageIndex_NumPages_GroupPartSize = Misc2;
Node.NumChildren = BitFieldExtractU32(ResourcePageIndex_NumPages_GroupPartSize, MAX_CLUSTERS_PER_GROUP_BITS, 0);
Node.NumPages = BitFieldExtractU32(ResourcePageIndex_NumPages_GroupPartSize, MAX_GROUP_PARTS_BITS, MAX_CLUSTERS_PER_GROUP_BITS);
Node.StartPageIndex = BitFieldExtractU32(ResourcePageIndex_NumPages_GroupPartSize, MAX_RESOURCE_PAGES_BITS, MAX_CLUSTERS_PER_GROUP_BITS + MAX_GROUP_PARTS_BITS);
Node.bEnabled = ResourcePageIndex_NumPages_GroupPartSize != 0u;
Node.bLeaf = ResourcePageIndex_NumPages_GroupPartSize != 0xFFFFFFFFu;
return Node;
}

After decode, shader will get this data:

C++

复制

struct FHierarchyNodeSlice
{
float4 LODBounds;
float3 BoxBoundsCenter;
float3 BoxBoundsExtent;
float MinLODError;
float MaxParentLODError;
uint ChildStartReference; // Can be node (index) or cluster (page:cluster)
uint NumChildren;
uint StartPageIndex;
uint NumPages;
bool bEnabled;
bool bLoaded;
bool bLeaf;
};

I will talk about decoding and usage in another part. So I just let you know how this data structure will be used.

#### A Tree of Mips

So, let's make a summary, Nanite will build a tree like this:

One question is, WHY?

It looks like we have many duplicated version of the same mesh inside one tree. Because different mips will cover each other although they are in different LOD levels.

I will explain more in the next part, but since this question takes me long time to figure out, let me give you a short answer:

Yes, different mips will be store inside the same tree, but during the culling, for a specific area, only one best part will be selected by considering the LOD error.

## Decoding and Usage

As a virtual geometry system like virtual texture, the system should have a feedback loop like this. CPU uploads the requested data to GPU, and the GPU will tell the CPU about the request data in the next frame.

So we start at the culling part to see how Nanite works with this system.

### Culling

Nanite has two culling steps: instance culling and persistent culling. In this article, we skip the instance culling part and focus on the persistent culling to see how Nanite does the persistent cluster culling. (And this is a really interesting part. If you open ClusterCulling.usf you will see a long comment before the PersistentClusterCull function).

The culling part need to achieve two targets: remove invisible nodes, and find best cluster with current error threshold.

For the first target, it is done by Frustum Culling and HZB Culling based on the bounds of each node. I will skip this part.

And, for the second target, do you remember two things we mentioned before:

Each cluster group (which is the culling unit) has a MaxLODError, and cluster group B contains simplified clusters of cluster group A will have higher MaxLODError.

The hierarchies nodes contains all the mip of cluster groups.

Lets' find a best slice of cluster groups like Epic explained in the official Inside Unreal talk:

(This is a conceptual image, not the acutal node tree)

There are two rules:

For a node, if the max LOD error is larger than the target value, we will visit its children. Otherwise, we ignore this node's children since it is too detailed. The code is the ShouldVisitChild function in ClusterCulling.usf.

For a cluster, if the LOD error is lower than the target value, we will choose this cluster. The code is in the SmallEnoughToDraw function in ClusterCulling.usf.

So, I will answer the question that why these two parts will not be visible at the same time:

(Please notify, the following explaination is just for easier explain the selection process, the detail calculation is much more complicated)

Now let's look at this fake tree. The MaxLODError of each node is showed in the rect.

Let's assume we are looking for a slice of error 0.5. We start with the top mip nodes. The mip 2 and 3 are too detailed so their max LOD errors are smaller than 0.5. They will not be visited .(sorry ...)

Now for the mip 3 and 4, they are visited based on the 1st rule.

In the end, both the red cluster group and the right blue cluster group will be visited, then we apply the 2nd rule. Since only the red cluster group cluster has errors lower than 0.5, we will let left clusters visible, and make the two parts in the blue group invisible.

We can image this selection as "the 3 clusters in the red cluster group are replaced the two clusters in the blue group".

Let's do this again, but now we choose the error value as 0.8.

Now the node with 0.71 Max error is not larger than the target error 0.8. We will not visit this node. But the blue group will be visited. And since 0.7 is smaller than 0.8, the two clusters inside the blue cluster group will be visible.

Now please let me make a summary: with this design, Nanite will choose a cluster target, both considering about the detail ( for lower Max LOD error of each node) and the performance (for choosing a cluster just a little bit lower than the target error). The LOD error of two simplified clusters will be force to have the same LOD error (https://www.notion.so/Brief-Analysis-of-Nanite-94be60f292434ba3ae62fa4bcf7d9379#bfbee7bff9294984a369fb48b1e57e25) makes the transition from the simplified clusters to the detailed version happens in the same time, so there will be no holes. Because the two clusters inside the blue cluster group will never have a chance to be visible and invisible individually, they have the same LOD error, and so will be choosed to show or hide in the same time. Genius! 天才！

I just want to notify, (thanks https://twitter.com/zouyuheng_1998 ), I assume the target error is like a fixed number to make the explanation easier. But actually, this error number is calculated, with view info, to let Nanite can select better LOD based on the view angle. If you want to check the details, please read the GetProjectedEdgeScales call inside the SmallEnoughToDraw function.

## In the End

I planned to discuss more details about Nanite based on the source code, but I want to split these articles into many parts (like Nanite). I need more time to read the Nanite's shader codes.

The design of Nanite is a great work, the ideas behind that are always shock me. Really thanks for Epic to develop this great system, and open the source code for everyone to learn from Nanite.

And, really thanks for every people reading, discussing and sharing the ideas based on analysis of Nanite. I also learn many things from you.

See you again!