Managed primitives Part I

The topic for today is managed primitive types, like T[] arrays, strings and objects, and their difference across the runtimes

Why?

It can help you understand how much memory footprint you are leaving behind when allocating a string or an array. This information will also help you better understand the underlying structure of the runtime.

But one of the best features: you can forge your own objects and/or interop them between managed/unmanaged without marshalling and with zero overhead.

Source Code Could be found on GitHub

System.Object

First and the most notable difference is that in real dotnet, all managed reference object pointers are stored with a 1*IntPtr.Size offset, whereas there is no offset in Mono/IL2CPP.

Let’s take a look at the object header:

typedef struct{
  void *sync;
  // dotnet object pointer starts here (sizeof(void*))
  void *vtable;
} DotnetObject;

typedef struct{
  // Mono object pointer starts here (0)
  void *vtable;
  void *sync;
} MonoObject;

So, every managed reference object has a vtable (a struct is a different topic, not included here) and synchronization data. Synchronization data is used for implementing thin locks, storing some implementation-defined data, like a hashcode (sometimes), and other information. It varies significantly across runtimes, so it’s better not to touch it.

VTable is simillar to the C++ vtable. It’s mostly used for storing pointers to the member’s methods, static variables and so on. However, it also stores pointers to basic functions like GetHashCode, ToString, GetType, etc.

Storing GetType there means that if we replace the vtable of the object, then the CLR will treat it as an object of a different type.

Be careful though

Sometimes, the JIT devirtualizes methods based on the input object types. This helps optimize the entire program better. So, if the JIT devirtualizes an object method call, your vtable changes wouldn’t be reflected, as the ‘callvirt’ instruction is replaced with a direct call call to a specific method.

Then, after this data we store our object fields. With non-gc blittable types it’s it’s relatively straightforward:

[StructLayout(LayoutKind.Sequential)]
class Object2{
  public int a,b;
}

[StructLayout(LayoutKind.Sequential)]
class Object3 : Object2{
  public int c,d;
}

//would be the same as those structs in C
typedef struct{
  MonoObject header;
  int a,b;
} Object2;

typedef struct{
  MonoObject header;
  int a,b,c,d;
} Object3;

As we can see there, the layout is the same.

But when we add “a bit of spice”, GC fields or unblittable types (latest dotnet allows to layout bool and char as blittle types, but .NET Framework doesn’t) then the struct/class is promoted to auto-layout. Consequently, field layout becomes runtime-dependent, meaning that we can no longer guarantee a specific order.

Some example:

// This attribute would be ignored.
[StructLayout(LayoutKind.Sequential)]
class Test{
  public int a;
  public int b;
  public string c;
}

class Test2 : Test{
  public int d;
  public string e;
}

//on .NET 8.0 on x64 would be layed out as:
typedef struct{
  DotnetObject header;
  void* c;
  int a, b; //or b and then a, it's not guaranteed.
}Test;

typedef struct{
  DotnetObject header;
  void* c;
  int a, b; //or b and then a, it's not guaranteed.
  void* e;
  int d;
  int __pad;//if padding to 8 bytes enabled, runtime dependent.
}Test2;

So, when there are GC fields in the object, sequential layout is no longer guaranteed. The same applies to structs as well.

However, when your object doesn’t have GC or unblittable fields, the CLR will layout the object in the same way as it was declared.

T[] arrays

Arrays is basically a memory section with a header which consists of general object header, length of the array followed by the 8 byte (at start) aligned data. Therefore, an array doesn’t store a pointer to the data; instead, it prefixes the allocation with a header.

typedef struct{
  MonoObject header;
  MonoArrayBounds *bounds; // null for "fixed" arrays
  int64_t length; // OR int32_t, depends on mono build configuration
  int64_t data[0]; // the actual data type is not int64_t, this hint is used for proper alignment.
} MonoArray;

Common way to get the data offset:

var array = new T[1];
void *arrayPtr = *(void**)Unsafe.AsPointer(ref array);
void *elementPtr = (void*)Unsafe.AsPointer(ref array[0]);

int dataOffset = (int)((byte*)elementPtr - (byte*)arrayPtr);

This offset would be runtime dependent, but static across all array types and execution on the same platform, so you can calculate it once and use later. You can even assume that within single runtime+version+architecture it will remain the same.

You can also define array as a struct, but you’ll have to define it separately per-runtime, it varies between Mono, .NET FX and .NET Core.

Obtaining array length:

void *elementPtr = (void*)Unsafe.AsPointer(ref array[0]);
int *dataLength = (int*)((byte*)elementPtr - IntPtr.Size);

Now that you have an array length, you can change its length or access it in unmanaged code, depending on your use case.

Additionally, with this code, you can create your own array without GC tracking (which will be demonstrated in the Part II article).

This is cross-runtime way of obtaining length and data pointer, so you are safe to use it in IL2CPP, Mono, .NET FX and .NET Core.

You can also replace the array’s element type by changing the VTable of the object. If you’re creating your own object in C# or C++, you need to set the correct array type (you can copy the VTable from ‘Array<T>.Empty’).

String

String is kinda simillar to the array in terms of the storage. It is an object with a header and the string length. Strings chars are stored as wchar_t (so 2 bytes per symbol) and following the header. Length is stored in symbols unit.

There’s also compiler intrinsic to get the offset of string data.

RuntimeHelpers.OffsetToStringData

string myString = "Hello there!";
void *stringPtr = *(void**)Unsafe.AsPointer(ref myString);
ushort *dataPtr = (ushort*)((byte*)stringPtr + RuntimeHelpers.OffsetToStringData);
int *strLength = (int*)((byte*)stringPtr + IntPtr.Size * 2);

// prints Length: 12
Console.WriteLine($"Length: {*strLength}");

// prints First symbol: H
Console.WriteLine($"First symbol: {(char)(dataPtr[0])}");

With this approach you can also forge the string object from native land (Will be shown in the Part II article).

List<T>

Lists are different, because in some runtimes they are implemented in C#, and not in native land. But the common scheme is simple: it’s a class object (so, a header is here), contains array object pointer (vs the data next to the header in the array), contains length (Count) and version.

To be continued

More on lists and unsafe usage will be clarified in the Part II.

Leave a Reply

Your email address will not be published. Required fields are marked *