Is .NET 8 performant enough?

While I’m experementing on making my execution environment for C++/C# interop I became curious about the performance of .NET 8 (under CoreCLR, not Mono). Maybe I can use it without Burst and transition via P/Invoke at all?

Test suite

Thankfully there’s already a test bench for Burst: https://github.com/nxrighthere/BurstBenchmarks

As we can see there, .NET Core 3.1’s RyuJIT is close enough to Burst, except for some heavy benchmarks like Pixar Raytracer:

.NET 8 release notes state that there are many performance improvements. Should we test it?

Of course we should! In the process of the tests I’ve discovered that Burst is also improved quite a bit, but that’s not a topic for today.
I also won’t compare it with IL2CPP and MonoJIT because they are obvious outliers and GCC/Clang because it requires C++ compiling.

Testing environment:
Ryzen 7 3700X
Double-channel 2400x2 DDR4 RAM (2x16GB)
Windows 11 22621.2338
Unity 2023.3.3f1
Burst 1.8.8
.NET 8.0.100-rc.1.23463.5

Name / Ticks	Burst	.NET Core 3.1	Ratio
Fibonacci	54436007	59585552	0.91
Mandelbrot	18415897	23077983	0.80
NBody	58811032	76383602	0.76
Sieve of Eratosthenes	18915145	23110182	0.82
Pixar Raytracer	183511766	486419618	0.38
Fireflies Flocking	54675877	137409462	0.40
Polynomials	18634000	42546783	0.43
Particle Kinematics	51025754	125781180	0.40
Arcfour	87181726	103477418	0.84
Seahash	205824138	128275313	1.60
Radix	58589890	61706631	0.95

As we can see there, Burst is really faster in most of the cases, and it’s hard to average how much. In vectorized tests Burst shines and beats the soul out of .NET Core 3.1, but loosing badly in the tests with integer math (Seahash test)

Upgrading to .NET 8 (well, just switching the runtime version in Rider)

Name / Ticks	Burst	.NET 8.0	Burst Ratio	3.1 Ratio
Fibonacci	54436007	61106279	0.89	0.97
Mandelbrot	18415897	23280421	0.80	1.00
NBody	58811032	76712782	0.76	1.00
Sieve of Eratosthenes	18915145	32484755	0.58	0.71
Pixar Raytracer	183511766	120394846	1.52	4.00
Fireflies Flocking	54675877	111650793	0.48	1.23
Polynomials	18634000	26005384	0.71	1.63
Particle Kinematics	51025754	127686529	0.40	0.99
Arcfour	87181726	103320131	0.84	1.00
Seahash	205824138	127394971	1.60	1.00
Radix	58589890	60683769	0.95	1.00

.NET is greatly improved in float math heavy tests, with difficult branching, like Pixar Raytracer. It’s FOUR TIMES faster than .NET Core 3.1, and 1.5 times faster than Burst can provide.
At least we are not in the worse condition, except for Sieve of Eratosthenes. For some reason it’s 15-30% slower than .NET Core 3.1

Burst still shines where auto-vectorization is a big win, like particle kinematics.
But what if we give a little bit of hint for our JIT? Particle Kinematics test seems like an obvious case for SIMD, which is not automatically provided by RyuJIT yet.

// Original code
private struct Particle
{
    // Note it's floats, not Vector3s/float3s
    public float x, y, z, vx, vy, vz;
}

// ...

for (uint a = 0; a < iterations; ++a)
{
    for (uint b = 0, c = quantity; b < c; ++b)
    {
        Particle* p = &particles[b];

        p->x += p->vx;
        p->y += p->vy;
        p->z += p->vz;
    }
}

First of all we will apply the most obvious correction, switch to SIMD-supported System.Numerics.Vector3

private struct Particle
{
    public Vector3 pos, velocity;
}

// Corrected calculation code

...

for (uint a = 0; a < iterations; ++a)
{
    for (uint b = 0, c = quantity; b < c; ++b)
    {
        Particle* p = &particles[b];
        p->pos += p->velocity;
    }
}

Name / Ticks	Burst	.NET 8 With Help	Burst Ratio	.NET 8.0 Ratio
Particle Kinematics	51025754	100510745	0.50	0.80

As we can see, this small change helps compiler a lot (20% faster than original .NET 8 version), but Burst is still twice as fast!

Machine-aligned vectors.

Since we are using Vector3 which is not aligned by machine’s 128-bit vector, let’s try to fix that and see if it helps.

private struct Particle
{
    //replace Vector4 to float4 in Burst
    public Vector4 pos, velocity; 
}

Name / Ticks	Burst vec4	.NET 8 vec4	Burst Ratio	.NET 8.0 Ratio
Particle Kinematics	30433174	42636315	0.71	0.33

As we can see, not only our code became three times faster compared to the original after JIT, but it’s also faster in comparison with Burst. Now it’s 30% slower than Burst optimized code.

Can we help .NET a little bit more?


for (uint a = 0; a < iterations; a++) // was ++a
{
    Particle *pt = particles; // init pointer here

    //for (uint b = 0, c = quantity; b < c; ++b)
    // Remove c, don't confuse the JIT	
    for (uint b = 0; b < quantity; b++) // was ++b
    {
        //Particle* p = &particles[b];
        pt->pos += pt->velocity;
        pt++; // manually update pointer instead.
    }
}

Name / Ticks	Burst vec4	.NET 8 vec4	Burst Ratio	.NET 8.0 Ratio
Particle Kinematics	30433174	32350603	0.94	0.25

Now we are only 5% slower than Burst.

A word on NativeAOT

In most of the tests NativeAOT show simillar performance to the JIT. The only profitable tests are Fibonacci (0.72x), Fireflies Flocking (0.80x) and Particle Kinematics after manual Vector3 addition give us glorious 0.66x ratio!

Conclusion

In my opinion, .NET 8’s JIT is mature enough to use in performance critical scenarios, given that it’s full .NET support, and not the subset like Burst (which allows it to produce faster code).
Much more mature than .NET 7, which is not shown there, but according to my tests, .NET 8 is much faster than .NET 7 on those same tests

Even when using JIT, and not AOT, it seems like an awesome environment to use!

But, you need to help your JIT

At least a little bit. It’s not yet capable of auto-vectorization, so in the scenarios where SIMD gives a huge help you’d better to use intrinsic Vector types from System.Numberics package.

A little sidenote on why Burst is faster in Vector3 scenario

// The code produced by .NET 8 with Vector3
vmovss xmm0, [r8+8]
vmovsd xmm1, [r8]
vshufps xmm1, xmm1, xmm0, 0x44
lea r10, [r8+0xc]
vmovss xmm0, [r10+8]
vmovsd xmm2, [r10]
vshufps xmm2, xmm2, xmm0, 0x44
vaddps xmm0, xmm1, xmm2
vmovsd [r8], xmm0
vpshufd xmm1, xmm0, 2
vmovss [r8+8], xmm1

// The code produced by Burst
vmovsd xmm0, qword ptr [rdi]
vinsertps xmm0, xmm0, dword ptr [rdi + 8], 32
vmovsd xmm1, qword ptr [rdi + 12]
vinsertps xmm1, xmm1, dword ptr [rdi + 20], 32
vaddps xmm0, xmm1, xmm0
vmovlps qword ptr [rdi], xmm0
vextractps dword ptr [rdi + 8], xmm0, 2

As we can see there, Not only Burst generates less code, but it’s also a little bit more clever. Instead of shuffling and having scalar movs, Burst just loads two Vector4s (instead of Vector3s), then add Vector4s together, and extracts Vector2 and one scalar to the position.

It’s totally possible to do in .NET, and according to my tests, it helps making it 20% faster than original generated code, but still not that fast as Burst gives us.

Be aware though

private struct Particle
{
	public float x,y,z,w;
}

Particle *p = ;
float *f= (float*)p;
var myVec = new Vector4(f[0], f[1], f[2], f[3]);

While it seems exactly like loading a Vector4 *(Vector4*)(p) this will absolutely kill your performance! JIT is not capable to understand that we are loading whole vector and loads it float by float.

Post Views: 4,162

Test suite

Machine-aligned vectors.

Can we help .NET a little bit more?

A word on NativeAOT

Conclusion

Leave a Reply Cancel reply