The cost of Unity’s forgiveness

Sometimes I love to go deeper than situation requires to. And that’s how you can find the hidden pearl for performance of your games built with unity.

Unity is user-friendly and error-prone engine which is suitable both for newbies, and mature developers with tons of titles behind their back.

Unity’s built with C++, which is known for it’s high complexity and high performance. CPUs are quite smart at guessing which branches would be took in your code, which region of memory you’ll ask next in loop, etc, etc. Its’ compiler is REALLY smart at optimizing.

But.

It’s not that hard to ruin optimizations, persuading creation of user-friendly and forgiving environment. Unity also has mostly single-threaded API, that’s much easier to maintain, and easier for user to understand.

So engine provides some safety mechanisms, such as checking if object is still alive (internalPointer != null) -> making call to an native side of engine -> check if user executed function inside main thread, is input object belongs to some managed type. In some cases with multithreaded APIs like TransformAccess which could be used in Job Worker Threads engine could perform lock/thread sync.

Glue between engine and “user-land” (aka C#) working over Mono’s internal calls (you can notice some API functions like transform.localPosition (get) using MethodImpl.InternalCall attribute, this means that when method is executed (if AOTed with IL2CPP) it will first check if there a cached pointer of the function “UnityEngine.Transform::get_localPosition(Vector3 &output)”, then execute function by a pointer (which itself comes from UnityEngine.dll) passing arguments, like get_localPosition(transform, &csharpVec)

Functions executed by a pointer (CALL 0x1234567 asm instruction) can’t be inlined, nor an ILP/vectorizer can optimize that call inside tight loop. And that could make a huge difference in performance.

Jump to the solution

Are we doomed then? We can't inline functions, can't bypass safety checks, seems like there are no chanes to optimize ever futher.
You might ask

No. We are not. There are plenty of room to optimizations.

Let’s start with a little code:

[SerializeField]
private Behaviour behaviour;
private bool[] enableStates;
// in Start():
enableStates = new bool[10_000];

...

// in Update():
for (int i = 0; i < 10_000; i++){
  enableStates[i] = t.enabled;
}

This code does really simple thing, it takes 1 (one, single) behaviour and checks if it enabled, then writing result to an array. It’s single for a reason, because here we will measure a cost of calling to native side going though all safety measures getting exactly 1 byte from native object.

This code takes 250µs (averaged by 500 frames, IL2CPP Master, Ryzen 7 3700X, Windows … whocarestho)

Let’s compile our project with IL2CPP (enable “Create Visual Studio Solution” in Build Settings), open the solution, then Assembly-CSharp.cpp, and locate update function and code for iteration:

// enableStates[i] = t.enabled;
BooleanU5BU5D_tD317D27C31DB892BE79FAE3AEBC0B3FFB73DE9B4* L_3 = __this->___enableStates_6;
int32_t L_4 = V_1;
Behaviour_t01970CFBBA658497AE30F311C447DB0440BAB7FA* L_5 = V_0;
NullCheck(L_5);
bool L_6;
L_6 = Behaviour_get_enabled_mAAC9F15E9EBF552217A5AE2681589CC0BFA300C1(L_5, NULL);
NullCheck(L_3);
(L_3)->SetAt(static_cast<il2cpp_array_size_t>(L_4), (bool)L_6);
// for (int i = 0; i < setsCount; i++)
int32_t L_7 = V_1;
V_1 = ((int32_t)il2cpp_codegen_add(L_7, 1));

Perfect. That’s what exactly going on in single loop iteration,

  1. Get an array pointer from the heap, [Type]U5BU5D_t* means array of specific type
  2. Get a behaviour from the heap (it’s serialized inside component)
  3. Check it for null
  4. Call engine function passing L_5 (behaviour) as this
    • Check against all safety mechanisms
    • Unwrap a native object
    • Get a value
    • Return the result
  5. Check an array for null
  6. Set an array element checking its bounds
  7. Increment loop index

Ok, what can we shave off here? Seems like we can remove null and array bounds checks, seems like a good candidates for optimization.

Right? Yeah, kinda, but it’s tweakable with [Il2CppSetOption] (Unity Manual), and performance cost of them is not that big, compiler know what’s happening, there are no external calls, most of the time (in our case: always) those checks would be correctly predicted by the CPU.

We are here to optimize something more cruel. Engine safety checks.

Let’s dive into Behaviour_get_enabled_mAAC9F15E9EBF552217A5AE2681589CC0BFA300C1:

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR bool Behaviour_get_enabled_mAAC9F15E9EBF552217A5AE2681589CC0BFA300C1 (Behaviour_t01970CFBBA658497AE30F311C447DB0440BAB7FA* __this, const RuntimeMethod* method) 
{
	typedef bool (*Behaviour_get_enabled_mAAC9F15E9EBF552217A5AE2681589CC0BFA300C1_ftn) (Behaviour_t01970CFBBA658497AE30F311C447DB0440BAB7FA*);
	static Behaviour_get_enabled_mAAC9F15E9EBF552217A5AE2681589CC0BFA300C1_ftn _il2cpp_icall_func;

	if (!_il2cpp_icall_func)
	_il2cpp_icall_func = 
(Behaviour_get_enabled_mAAC9F15E9EBF552217A5AE2681589CC0BFA300C1_ftn)
il2cpp_codegen_resolve_icall ("UnityEngine.Behaviour::get_enabled()");

	bool icallRetVal = _il2cpp_icall_func(__this);
	return icallRetVal;
}

ooo-ok, what’s going on here is that IL2CPP mimicking Mono internal calls, and resolving that function pointer. This function takes managed-side (that’s important) pointer as first (this) argument, and returns boolean.

But hey, we can’t see any safety checks here…

Right, that’s because they are inside this function, on the engine side. We need to go deeper.

Remember: Everything is just a bunch of bytes.

C# objects are just wrappers/handlers over the native ones, and Unity is hiding pointer to the native object under private modifier. But that’s not a limit anymore (read my previous post).

For the simplicity I’ll show something inside already created project, in IL2CPP. Let’s write out own method getting enable state directly from the native object:

IL2CPP_EXTERN_C IL2CPP_METHOD_ATTR bool Behaviour_get_enabled(Behaviour_t01970CFBBA658497AE30F311C447DB0440BAB7FA* __this)
{
	uint8_t* ptr = (uint8_t*)__this->___m_CachedPtr_0;
	return ptr[80] != 0;
}

Don’t forget to replace call in Update function to your method, and compile solution.

From previous post for C#
Behaviour t;

var behaviourNativePtr = (byte*)t.GetInternalPointer();
var enabled = behaviourNativePtr[80] != 0;

Ok, is there some magic going on? No. We just take a pointer to the native (in engine) object from managed object, reinterpret it to byte*, and grab 80th of them.

Would that work? Yes, I’ve checked. well, at least for Unity 2022.1.22f1, non-dev build, x64

How much does it takes now?

9.9µs averaged by 500 frames.

10µs instead of 250! It’s 25 times faster. That’s quite good gain for almost nothing!

This approach is applicable to other native stuff like gathering transform’s parents, game object names, getting access to the meshes’ internal data buffers (just not to copy them/modify in-place), getting local position/rotation/scale of transforms, etc. It’s a quite broad topic to investigate, but it does require a time to guess offsets and data types.

Offsets in same components could differ depending on:

  1. Is running in editor or not
  2. Unity version (most of the time by major releases, 2019, 2020, etc)
  3. Architecture (x86/x64) (always)
  4. Is it dev. build or not (sometimes)
  5. (almost never) by a building platform (Android, Windows, iOS, etc)

But all of thouse differences worth the optmization for complex algothims, or tight-loop access, gathering/preparing data, etc.

And since checks are now lift off, you are no longer restricted to a single thread. Yes. You can use those in Burst/C# Thread etc. Just be sure, that GC won’t remove your managed object, and you don’t destroy them while executing your calculations. Otherwise it’s straight road to the crash.

See you next time!

Cheers.

2 thoughts to “The cost of Unity’s forgiveness”

    1. Usually you can either use Ghidra (because there are PDBs, but I haven’t told it :))) or more classic and sometimes more robust approach with Cheat Engine. After you get the internal pointer, jump to a given location in CE, and change values from editor watching how and where they change.

Leave a Reply

Your email address will not be published. Required fields are marked *