Injecting bahaviour into time critical code

Thanks!

This post was heavily inspired by infographics by Bartosz Adamczewski @badamczewski01 and wouldn’t be possible without sharplab.io by Andrey Shchekin @ashmind and BenchmarkDotNet by contributors.

Context: TimSort for .NET

So I wanted to (re)implemented TimSort for .NET. My previous attempt was relatively successful (well, it worked) but it could be better. Closure of CodePlex (that’s where it was hosted before) gave me perfect opportunity - I need to move it, but I can also rewrite it.

One of the problems with original implementation was the fact that I wanted to have all types of array-like objects (Array, List, and IList). The problem is the choice was (at least that’s what I thought at the time) implementing it 3 times, or implementing it for IList only and take the performance hit (as accessing Array though IList interface comes at the price).

In the end, it was implemented 3 times and do avoid maintaining 3 code bases the actual algorithm was in T4 template (.tt file) so 3 implementations were generated from it.

Time has shown that I knew very little then…

Insertion sort

This post is not about TimSort though (as at the time of writing it is not finished yet). This post is about injecting behaviour into algorithms while retaining original performance.

So, insertion sort is simple O(n^2) sorting algorithm, which actually has some real-life usage. It is actually irrelevant, which one I use for this post, but if I need to use any let’s use the one which is actually has some use cases: it is very fast (O(n)) for data which is nearly sorted and is used as a part of IntroSort combo (quicksort + heapsort + insertion sort). For example, IntroSort is used by .NET.

Let’s start:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public static void Sort(int[] array, int start, int length)
{
var lo = start;
var hi = start + length;

for (var i = lo + 1; i < hi; i++)
for (var j = i; j > lo; j--)
{
if (array[j - 1] <= array[j])
break;

Swap(array, j, j - 1);
}
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void Swap(int[] array, int a, int b)
{
var swap = array[a];
array[a] = array[b];
array[b] = swap;
}

…and measure baseline implementation:

1
2
3
4
5
6
7
8
9
10
11
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.300
[Host] : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT
Job-CNGZDG : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT

InvocationCount=1 UnrollFactor=1

| Method | Size | Mean | Error | StdDev |
|---------------------------- |------ |----------:|---------:|---------:|
| ArrayOfInt | 10000 | 41.73 ms | 0.030 ms | 0.025 ms | <==

Will it work for any type?

That’s the problem. It was been tailored for array of ints (int[]). There are two issues to address: accessing generic indexable data (let’s say IList<...>) and comparing items (let’s say IComparer<...>).

That’s super simple! Let’s do it!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public static void Sort<T>(
IList<T> array, int start, int length, IComparer<T> comparer)
{
var lo = start;
var hi = start + length;

for (var i = lo + 1; i < hi; i++)
for (var j = i; j > lo; j--)
{
if (comparer.Compare(array[j - 1], array[j]) <= 0)
break;

Swap(array, j, j - 1);
}
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void Swap<T>(IList<T> array, int a, int b)
{
var swap = array[a];
array[a] = array[b];
array[b] = swap;
}

That wasn’t hard, was it? Fantastic, problem solved new implementation works for int[] as well but is completely generic. PROFIT! Let’s just measure performance and go home…

1
2
3
4
5
6
7
8
9
10
11
12
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.300
[Host] : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT
Job-CNGZDG : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT

InvocationCount=1 UnrollFactor=1

| Method | Size | Mean | Error | StdDev |
|---------------------------- |------ |----------:|---------:|---------:|
| ArrayOfInt | 10000 | 41.73 ms | 0.030 ms | 0.025 ms |
| ListOfT | 10000 | 339.24 ms | 0.191 ms | 0.149 ms | <==

…or not. It is 8 times slower now.

I heard it is about virtual calls to interfaces?

Yes, it is. Simple comparison (most likely one assembler instruction) became a call to virtual method which is (relatively) very slow. What can we do though? Well, not pass interfaces but pass classes, yes? Well, it still would need to be virtual call to make behaviour polymorphic.

How to make a static call to something with polymorphic behavior? This sounds stu… Generics!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public static void Sort<T, TIndexer, TComparer>(
TIndexer array, int start, int length, TComparer comparer)
where TIndexer: IList<T>
where TComparer: IComparer<T>
{
var lo = start;
var hi = start + length;

for (var i = lo + 1; i < hi; i++)
for (var j = i; j > lo; j--)
{
if (comparer.Compare(array[j - 1], array[j]) <= 0)
break;

Swap<T, TIndexer>(array, j, j - 1);
}
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void Swap<T, TIndexer>(TIndexer array, int a, int b)
where TIndexer: IList<T>
{
var swap = array[a];
array[a] = array[b];
array[b] = swap;
}

Now all calls are done to actual class not to interface so when we call:

1
Sort<int, int[], Comparer<int>>(_data, 0, _data.Length, Comparer<int>.Default);

it hopefully will get expanded to call actual int[] not IList<int>… Let’s measure it:

1
2
3
4
5
6
7
8
9
10
11
12
13
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.300
[Host] : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT
Job-CNGZDG : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT

InvocationCount=1 UnrollFactor=1

| Method | Size | Mean | Error | StdDev |
|---------------------------- |------ |----------:|---------:|---------:|
| ArrayOfInt | 10000 | 41.73 ms | 0.030 ms | 0.025 ms |
| ListOfT | 10000 | 339.24 ms | 0.191 ms | 0.149 ms |
| ListOfTGenerics | 10000 | 339.19 ms | 0.323 ms | 0.270 ms | <==

Oh my, it is not working! It is actually completely identical and it still calls interfaces.

Does it mean that I really need to implement it multiple times, for Array, List, IList and… Span? Is there any way to implement it just once but use in different contexts?

Value types for the win!

Wait a seconds… When .NET expands generic class it does it for each value type separately, but not for reference types. That’s why even if int[] was passed as generic argument it was not generating “dedicated” implementation for int[] it was still using IList<int> interface. Let’s try to force it by using value types.

NOTE: I don’t want to use implement whole IList<T> in my struct so let’s simplify it a little bit as the only thing we actually use from IList<T> is indexed access:

1
public interface IIndexer<T> { T this[int index] { get; set; } }

so now we can implement struct which implements this interface for array:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public readonly struct ArrayIndexer<T>: IIndexer<T>
{
private readonly T[] _array;

public ArrayIndexer(T[] array) => _array = array;

public T this[int index]
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
get => _array[index];
[MethodImpl(MethodImplOptions.AggressiveInlining)]
set => _array[index] = value;
}
}

We can (re)implement insertion sort using this interface:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public static void Sort<T, TIndexer, TComparer>(
TIndexer array, int start, int length, TComparer comparer)
where TIndexer: IIndexer<T>
where TComparer: IComparer<T>
{
var lo = start;
var hi = start + length;

for (var i = lo + 1; i < hi; i++)
for (var j = i; j > lo; j--)
{
if (comparer.Compare(array[j - 1], array[j]) <= 0)
break;

Swap<T, TIndexer>(array, j, j - 1);
}
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static void Swap<T, TIndexer>(TIndexer array, int a, int b)
where TIndexer: IIndexer<T>
{
var swap = array[a];
array[a] = array[b];
array[b] = swap;
}

(NOTE: nothing changed in implementation, just method signature is slightly different)

Seems that BenchmarkDotNet agrees with me as we definitely have improvement:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.300
[Host] : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT
Job-CNGZDG : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT

InvocationCount=1 UnrollFactor=1

| Method | Size | Mean | Error | StdDev |
|---------------------------- |------ |----------:|---------:|---------:|
| ArrayOfInt | 10000 | 41.73 ms | 0.030 ms | 0.025 ms |
| ListOfT | 10000 | 339.24 ms | 0.191 ms | 0.149 ms |
| ListOfTGenerics | 10000 | 339.19 ms | 0.323 ms | 0.270 ms |
| IndexerOfT | 10000 | 65.58 ms | 0.079 ms | 0.066 ms | <==

Let’s do it for comparer as well. The only thing which we actually want from Comparer in this algorithm is <= (less or equal), so, let’s make a dedicated interface and dedicated implementation for int (because that’s what we are measuring):

1
2
3
4
5
6
7
public interface ILessOrEqual<T> { bool LeEq(in T a, in T b); }

public readonly struct IntComparer: ILessOrEqual<int>
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public bool LeEq(in int a, in int b) => a <= b;
}

Now we can modify implementation of algorithm to take this ILessOrEqual<T> interface:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public static void Sort<T, TIndexer, TComparer>(
in TIndexer array, int start, int length, in TComparer comparer)
where TIndexer: IIndexer<T>
where TComparer: ILessOrEqual<T>
{
var lo = start;
var hi = start + length;

for (var i = lo + 1; i < hi; i++)
for (var j = i; j > lo; j--)
{
if (comparer.LeEq(array[j - 1], array[j]))
break;

Swap<T, TIndexer>(array, j, j - 1);
}
}

So that would be it, let’s measure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
AMD Ryzen 5 3600, 1 CPU, 12 logical and 6 physical cores
.NET SDK=5.0.300
[Host] : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT
Job-CNGZDG : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT

InvocationCount=1 UnrollFactor=1

| Method | Size | Mean | Error | StdDev |
|---------------------------- |------ |----------:|---------:|---------:|
| ArrayOfInt | 10000 | 41.73 ms | 0.030 ms | 0.025 ms |
| ListOfT | 10000 | 339.24 ms | 0.191 ms | 0.149 ms |
| ListOfTGenerics | 10000 | 339.19 ms | 0.323 ms | 0.270 ms |
| IndexerOfT | 10000 | 65.58 ms | 0.079 ms | 0.066 ms |
| IndexerOfTAndComparer | 10000 | 41.74 ms | 0.032 ms | 0.029 ms | <==

Yay! We are back to original performance, but with generic algorithm!

Definitely calling this method is little bit cumbersome right now:

1
2
Sort<int, ArrayIndexer<int>, IntComparer>(
new ArrayIndexer<int>(array), 0, array.Length, new IntComparer());

but this can be hidden behind some nice overloads and you will have few of them (for T[], List<T>, IList<T>, or even IDictionary<int, T> - why not). The story behind Span<T> is slightly different. Possible, but a little bit more complicated (pinning is required).

Conclusion

The point is it can be almost as fast as dedicated version but also as generic as fully generic version.

Please note, it will be expanded for each indexer/comparer combination so your code will take more memory when JITted, but I don’t think this might be a problem.

I understand NetFabric.Hyperlinq uses same approach with Value delegates to make LINQ a little bit more performant without losing flexibility.