Is .NET 8 performant enough?

While I’m experementing on making my execution environment for C++/C# interop I became curious about the performance of .NET 8 (under CoreCLR, not Mono). Maybe I can use it without Burst and transition via P/Invoke at all?

Test suite

Thankfully there’s already a test bench for Burst: https://github.com/nxrighthere/BurstBenchmarks

As we can see there, .NET Core 3.1’s RyuJIT is close enough to Burst, except for some heavy benchmarks like Pixar Raytracer:

.NET 8 release notes state that there are many performance improvements. Should we test it?

Of course we should! In the process of the tests I’ve discovered that Burst is also improved quite a bit, but that’s not a topic for today.
I also won’t compare it with IL2CPP and MonoJIT because they are obvious outliers and GCC/Clang because it requires C++ compiling.

Testing environment:
Ryzen 7 3700X
Double-channel 2400x2 DDR4 RAM (2x16GB)
Windows 11 22621.2338
Unity 2023.3.3f1
Burst 1.8.8
.NET 8.0.100-rc.1.23463.5
Name / TicksBurst.NET Core 3.1Ratio
Fibonacci54436007595855520.91
Mandelbrot18415897230779830.80
NBody58811032763836020.76
Sieve of Eratosthenes18915145231101820.82
Pixar Raytracer1835117664864196180.38
Fireflies Flocking546758771374094620.40
Polynomials18634000425467830.43
Particle Kinematics510257541257811800.40
Arcfour871817261034774180.84
Seahash2058241381282753131.60
Radix58589890617066310.95

As we can see there, Burst is really faster in most of the cases, and it’s hard to average how much. In vectorized tests Burst shines and beats the soul out of .NET Core 3.1, but loosing badly in the tests with integer math (Seahash test)

Upgrading to .NET 8 (well, just switching the runtime version in Rider)

Name / TicksBurst.NET 8.0Burst Ratio3.1 Ratio
Fibonacci54436007611062790.890.97
Mandelbrot18415897232804210.801.00
NBody58811032767127820.761.00
Sieve of Eratosthenes18915145324847550.580.71
Pixar Raytracer1835117661203948461.524.00
Fireflies Flocking546758771116507930.481.23
Polynomials18634000260053840.711.63
Particle Kinematics510257541276865290.400.99
Arcfour871817261033201310.841.00
Seahash2058241381273949711.601.00
Radix58589890606837690.951.00

.NET is greatly improved in float math heavy tests, with difficult branching, like Pixar Raytracer. It’s FOUR TIMES faster than .NET Core 3.1, and 1.5 times faster than Burst can provide.
At least we are not in the worse condition, except for Sieve of Eratosthenes. For some reason it’s 15-30% slower than .NET Core 3.1

Burst still shines where auto-vectorization is a big win, like particle kinematics.
But what if we give a little bit of hint for our JIT? Particle Kinematics test seems like an obvious case for SIMD, which is not automatically provided by RyuJIT yet.

// Original code
private struct Particle
{
    // Note it's floats, not Vector3s/float3s
    public float x, y, z, vx, vy, vz;
}

// ...

for (uint a = 0; a < iterations; ++a)
{
    for (uint b = 0, c = quantity; b < c; ++b)
    {
        Particle* p = &particles[b];

        p->x += p->vx;
        p->y += p->vy;
        p->z += p->vz;
    }
}

First of all we will apply the most obvious correction, switch to SIMD-supported System.Numerics.Vector3

private struct Particle
{
    public Vector3 pos, velocity;
}

// Corrected calculation code

...

for (uint a = 0; a < iterations; ++a)
{
    for (uint b = 0, c = quantity; b < c; ++b)
    {
        Particle* p = &particles[b];
        p->pos += p->velocity;
    }
}

Name / TicksBurst.NET 8 With HelpBurst Ratio.NET 8.0 Ratio
Particle Kinematics510257541005107450.500.80

As we can see, this small change helps compiler a lot (20% faster than original .NET 8 version), but Burst is still twice as fast!

Machine-aligned vectors.

Since we are using Vector3 which is not aligned by machine’s 128-bit vector, let’s try to fix that and see if it helps.

private struct Particle
{
    //replace Vector4 to float4 in Burst
    public Vector4 pos, velocity; 
}
Name / TicksBurst vec4.NET 8 vec4Burst Ratio.NET 8.0 Ratio
Particle Kinematics30433174426363150.710.33

As we can see, not only our code became three times faster compared to the original after JIT, but it’s also faster in comparison with Burst. Now it’s 30% slower than Burst optimized code.

Can we help .NET a little bit more?


for (uint a = 0; a < iterations; a++) // was ++a
{
    Particle *pt = particles; // init pointer here

    //for (uint b = 0, c = quantity; b < c; ++b)
    // Remove c, don't confuse the JIT	
    for (uint b = 0; b < quantity; b++) // was ++b
    {
        //Particle* p = &particles[b];
        pt->pos += pt->velocity;
        pt++; // manually update pointer instead.
    }
}
Name / TicksBurst vec4.NET 8 vec4Burst Ratio.NET 8.0 Ratio
Particle Kinematics30433174323506030.940.25

Now we are only 5% slower than Burst.

A word on NativeAOT

In most of the tests NativeAOT show simillar performance to the JIT. The only profitable tests are Fibonacci (0.72x), Fireflies Flocking (0.80x) and Particle Kinematics after manual Vector3 addition give us glorious 0.66x ratio!

Conclusion

In my opinion, .NET 8’s JIT is mature enough to use in performance critical scenarios, given that it’s full .NET support, and not the subset like Burst (which allows it to produce faster code).
Much more mature than .NET 7, which is not shown there, but according to my tests, .NET 8 is much faster than .NET 7 on those same tests

Even when using JIT, and not AOT, it seems like an awesome environment to use!

But, you need to help your JIT

At least a little bit. It’s not yet capable of auto-vectorization, so in the scenarios where SIMD gives a huge help you’d better to use intrinsic Vector types from System.Numberics package.

A little sidenote on why Burst is faster in Vector3 scenario

// The code produced by .NET 8 with Vector3
vmovss xmm0, [r8+8]
vmovsd xmm1, [r8]
vshufps xmm1, xmm1, xmm0, 0x44
lea r10, [r8+0xc]
vmovss xmm0, [r10+8]
vmovsd xmm2, [r10]
vshufps xmm2, xmm2, xmm0, 0x44
vaddps xmm0, xmm1, xmm2
vmovsd [r8], xmm0
vpshufd xmm1, xmm0, 2
vmovss [r8+8], xmm1
// The code produced by Burst
vmovsd xmm0, qword ptr [rdi]
vinsertps xmm0, xmm0, dword ptr [rdi + 8], 32
vmovsd xmm1, qword ptr [rdi + 12]
vinsertps xmm1, xmm1, dword ptr [rdi + 20], 32
vaddps xmm0, xmm1, xmm0
vmovlps qword ptr [rdi], xmm0
vextractps dword ptr [rdi + 8], xmm0, 2

As we can see there, Not only Burst generates less code, but it’s also a little bit more clever. Instead of shuffling and having scalar movs, Burst just loads two Vector4s (instead of Vector3s), then add Vector4s together, and extracts Vector2 and one scalar to the position.

It’s totally possible to do in .NET, and according to my tests, it helps making it 20% faster than original generated code, but still not that fast as Burst gives us.

Be aware though
private struct Particle
{
	public float x,y,z,w;
}

Particle *p = ;
float *f= (float*)p;
var myVec = new Vector4(f[0], f[1], f[2], f[3]);

While it seems exactly like loading a Vector4 *(Vector4*)(p) this will absolutely kill your performance! JIT is not capable to understand that we are loading whole vector and loads it float by float.

Leave a Reply

Your email address will not be published. Required fields are marked *