Reference :- http://codebetter.com/blogs/gregyoung/archive/2006/06/11/146343.aspx
There are a lot of posts on the Internet discussing varying methods of looping using for loops and which perform best. These posts also generally give advice as to how you should handle your looping based upon performance metrics. Most of these types of posts suffer from an affliction I have discussed previously; they look at IL (Intermediate Language) to rationalize performance which is just flat out wrong as subtle differences can cause JIT (Just in Time) optimizations to go haywire.
In this post we will look at the unmanaged code produced by varying constructs to come up with a definitive answer to the performance question. We will also look at some other optimization methods that people claim happen but do not always happen and discuss some reasons why they may not be happening and of course set some rules for fast looping.
In trying to apply this the version of your framework is important as the JIT may vary from version to version, I do not believe anything here will change based upon the platform but that is a possibility. I used version 2.0.50727 for all examples in this document. Other versions such as Rotor, Mono, or 1.x will most likely show differing behaviors.
Hoisting?
Before I start getting into code I would like to discuss the concept of hoisting as it will be the focal point of this entire discussion. When dealing with loops we can break any given loop into three sections
1) The preamble (setting up for the loop, doing things such as setting our counter to 0)
2) The code within the loop
3) The code to increment our counter and check for the end of iteration possibly jumping back to step two
For the rest of this post I will color code the disassembled assembly as section one (green), section two (blue), section 3 (yellow) to make things easier to follow
Any code that is in step two or three will be run N times as the loop is executed. The code in step one will only be executed a single time when we first enter the loop. The concept of hoisting involves moving code out of steps two and three and into step one. Hoisting obviously gives you a huge performance gain as the code will only be run once.
Simple Hoisting Example
If you read alot articles on for loop performance they will tell you to not manually hoist because the JIT will end up doing it for you. In order to test this I have created the following test code (which can be used in conjunction with the harness discussed in the previous Performance Measurement post)
class Program { static int[] foo = new int[100000]; static int Dummy; static void Test1NoHoist() { int total = 0; for (int i = 0; i <> total|=i; } Dummy = total; } static void Test1WithHoist() { int total = 0; int length = foo.Length; for (int i = 0; i <> total|=i; } Dummy = total; } static void Main(string[] args) { TestHarness.Test("Test1NoHoist", 10000, newTestHandler(Test1NoHoist)); TestHarness.Test("Test1WithHoist", 10000, newTestHandler(Test1WithHoist)); Console.ReadLine(); } } |
Listing 1: Basic manual hoisting example |
You will quickly notice that in Test1WithHoist we manually took the check Length property on foo and moved it to before the loop; this is a common optimization if you come from a C/C++ world. The thought behind this is that since Length translates to get_Length() that we are saving ourselves from having to call the method n times (remember that this would be in the second or third section and everything in the third section is called n times).
An astute reader will notice an oddity dealing with the “Dummy” variable that is assigned after the loop. We will come back this a bit later on… good catch ;)
The by running our performance test, we can see that the code executes in roughly equivalent time.
Test | Total Time (ns) | Average (ns) |
Test1NoHoist | 820583612.35 | 82058.36 |
Test1WithHoist | 812735363.70 | 81273.53 |
Let’s take a look at the generated code to offer proof that the two bits of code should run in equivalent time (if you want to see the results for yourself remember to follow the instructions for getting JIT optimized code from Viewing Unmanaged Code in VS.NET.
Test1NoHoist | Test1WithHoist |
00000000 xor ecx,ecx 00000002 xor edx,edx 00000004 mov eax,dword ptr ds:[022B1EC4h] 00000009 mov eax,dword ptr [eax+4] 0000000c test eax,eax 0000000e jle 00000019 00000010 or ecx,edx 00000012 add edx,1 00000015 cmp eax,edx 00000017 jg 00000010 00000019 mov dword ptr ds:[00912FE8h],ecx 0000001f ret | 00000000 xor ecx,ecx 00000002 mov eax,dword ptr ds:[022B1EC4h] 00000007 mov edx,dword ptr [eax+4] 0000000a xor eax,eax 0000000c test edx,edx 0000000e jle 00000019 00000010 or ecx,eax 00000012 add eax,1 00000015 cmp eax,edx 00000017 jl 00000010 00000019 mov dword ptr ds:[00912FE8h],ecx 0000001f ret |
Listing 2: Disassembled versions of our functions |
A kind of neat item in these two bits of code is that the only real difference is that EAX and EDX are interchanged … This makes them look a bit more different than they are but rest assured they are pretty much identical. In both cases, the access to the stop variable has been hoisted out of the loop (in section three, both are comparing their counter to a register that was preloaded in section one). To better illustrate the point, here is a disassembly of the first for loop (without manual hoisting) when run without JIT optimizations (i.e. it will not be hoisted).
0000002a xor esi,esi 0000002c nop 0000002d jmp 00000034 0000002f nop 00000030 or edi,esi 00000032 nop 00000033 inc esi 00000034 mov eax,dword ptr ds:[02275A34h] ;inlined length 00000039 cmp esi,dword ptr [eax+4] 0000003c setl al 0000003f movzx eax,al 00000042 mov ebx,eax 00000044 test ebx,ebx 00000046 jne 0000002F |
Listing 3: First loop (no hoisting with optimizations disabled) non-important code removed |
If this does not convince you to NEVER release code that is not being optimized I don’t know what will. It should also give you an idea of how much work the JIT optimizer really does. What has happened here is that the property was inlined, but it was inlined into section three. The optimizer was smart enough to realize this and to move it up to the preamble. +1 for the optimizer
Non-Trivial Hoisting
Now that we have gone through a simple example, let’s try a more difficult one for the optimizer. In this example we will use a method GetUpperBound(0) which will not be inlined for us, let’s take a look at how well the JIT handles it. Here is the testing code to add to our previous code.
static void Test1NoHoistNoInlining() { int total = 0; for (int i = 0; i <> total |= i; } Dummy = total; } static void Test1WithHoistNoInlining() { int total = 0; int length = foo.GetUpperBound(0); for (int i = 0; i <> total |= i; } Dummy = total; } |
Listing 4: Tests without inlining available |
Next we should add the following to our Main to run the tests.
TestHarness.Test("Test1NoHoist", 10000, newTestHandler(Test1NoHoistNoInlining)); TestHarness.Test("Test1WithHoist", 10000, newTestHandler(Test1WithHoistNoInlining)); |
Listing 5: Calling our new methods |
When we run the tests in release mode, we get quite different results than previously. I am including the previous results in the table for comparison.
Test | Total Time (ns) | Average (ns) |
Test1NoHoist | 820583612.35 | 82058.36 |
Test1WithHoist | 812735363.70 | 81273.53 |
Test1NoHoistNoInlining | 25457518879.48 | 2545751.88 |
Test1WithHoistNnInlining | 824641533.67 | 82464.15 |
Interesting, not manually hoisting makes our function 32x slower. Before we even get into code on this one I will put my money on the optimizer not hoisting the call. I believe however there are some reasons for this that we will discuss after looking through the code.
Test1NoHoistNoInlining | Test1WithHoistNoInlining |
00000004 xor esi,esi 00000006 mov ecx,dword ptr ds:[022B1EC4h] 0000000c xor edx,edx 0000000e cmp dword ptr [ecx],ecx 00000010 call 792661F8 00000015 test eax,eax 00000017 jle 00000031 00000019 or edi,esi 0000001b add esi,1 0000001e mov ecx,dword ptr ds:[022B1EC4h] 00000024 xor edx,edx 00000026 cmp dword ptr [ecx],ecx 00000028 call 792661F8 0000002d cmp eax,esi 0000002f jg 00000019 | 00000003 mov ecx,dword ptr ds:[022B1EC4h] 00000009 xor edx,edx 0000000b cmp dword ptr [ecx],ecx 0000000d call 792661A8 00000012 mov edx,eax 00000014 xor eax,eax 00000016 test edx,edx 00000018 jle 00000023 0000001a or esi,eax 0000001c add eax,1 0000001f cmp eax,edx 00000021 jl 0000001A |
Listing 6: Disassembled versions |
As we suspected the code on the left hand side is not automatically hoisting the call of the function for us. It would seem that the JIT will not automatically hoist method calls for us, that instead we have to explicitly state that we want them hoisted on our own.
It seems that that the JIT cannot get to the point of having a value in a register or memory that it can consider its result and as such feels the need to make the call on every iteration. This would make sense in general as I may be depending upon the behavior of being called at every interval. Consider the following code.
static int t = 0; static bool KeepGoing() { Console.WriteLine("Still Going!"); t++; return t <> } static void TestShortMethod() { for (; KeepGoing();) { System.Threading.Thread.Sleep(100); } } |
Listing 7: Odd but valid code |
Obviously this is nasty code but it is still valid. This is a simplified example it appears that it was a conscious decision to not support these types of situations as they can quickly become extremely complex from the JIT point of view, my guess is that it won’t deal with anything beyond simply returning a variable which will be inlined into a simple read anyway. In my opinion, something like this should still be inlined (and left in section 3) to avoid the overhead of setting up the call on every iteration but I may be missing some other case that makes this more difficult. Based upon these results we can create the following rule.
If you wish to use a method call for your stop condition that does anything more complex than simply returning a variable or is not inlinable for other reasons such as being virtual; you should hoist it manually by placing it in the first part of your for loop in order to make it explicit to the JIT that you do not wish the behavior to be called on every iteration.
Mark Lubischer brought up an excellent point here. We can in fact also hoist the call by using for like this.
static int MarkWithoutInlining() { int total = 0; int[] length = new int[10000]; for (int i = 0, j = length.GetUpperBound(0); i <> total |= i; } return total; } |
This is a much better way of handling our hoisting as it better defines the scope of our variable while the behavior is in fact equivalent to our original hoist.
static int MarkWithoutInlining() { int total = 0; 00000000 push edi 00000001 push esi 00000002 push ebx 00000003 xor ebx,ebx int[] length = new int[10000]; 00000005 mov edx,2710h 0000000a mov ecx,7915982Ah 0000000f call FFB21D98 00000014 mov edi,eax for (int i = 0, j = length.GetUpperBound(0); i <> 00000016 xor esi,esi 00000018 mov ecx,edi 0000001a xor edx,edx 0000001c cmp dword ptr [ecx],ecx 0000001e call 792664C8 00000023 test eax,eax 00000025 jle 00000039 00000027 mov edx,dword ptr [edi+4] total |= length; 0000002a cmp esi,edx 0000002c jae 0000003F 0000002e or ebx,dword ptr [edi+esi*4+8] for (int i = 0, j = length.GetUpperBound(0); i <> 00000032 add esi,1 00000035 cmp esi,eax 00000037 jl 0000002A } return total; 00000039 mov eax,ebx 0000003b pop ebx 0000003c pop esi 0000003d pop edi 0000003e ret 0000003f call 792B42E9 00000044 int 3 |
A Bit About Foreach
Foreach is not an IL concept, it is a compiler concept. Compilers generate normal for loops when they see a foreach being used to iterate an array. If you are interested in seeing how foreach loops work I would highly recommend taking a look with ILDASM or Reflector. I will not discuss heavily foreach loops as they have at this moment no difference when they reach the native level.
Foreach could in the future offer many benefits over a general for loop. Since the foreach loop is explicitly stating that you want to iterate through the array (not allowing things like addition and subtraction to your counter), other things such as hoisting array bounds checks (which we will discuss shortly) would be much more easily accomplished. I would imagine that in the future foreach will be the preferred iteration construct.
I will assure you though that as of now foreach is not faster than the equivalent for loop when dealing with arrays, in fact both VB.NET and C# insure that they are identical. There is one case where foreach will be slower than a for loop and that is if you do not actually use the item within your loop, the for version will obviously be faster since it never loads the variable where as the foreach does this implicitly on every iteration. Foreach can however offer you great performance increases when dealing with other types that implement IEnumerable (not including collections as they simply perform an indexed lookup in their enumerator).
Logically one can come up with a great example for this when looking at a linked list. The for loop will cause ∑ n total operations on the iteration where as the foreach will only cause n. The reason for this is every index operation would have to start at the beginning of the list and iterate n nodes in order to return the nth node where as the enumerator will simply remember the last node it was on and give the next node. For this reason we will add our second rule.
When dealing with items that are enumerable as opposed to dealing with arrays or collections; prefer foreach to a for loop as the enumerator will often offer a much faster way of enumerating than using an index.
Array Bounds Check Hoisting
The check to find out whether or not we are valid to continue is not the only thing that can be hoisted from inside of a loop. In fact every time the variable is used in a comparison within the loop is an opportunity for hoisting to occur. The first example of this type of hoist we will look at is array bounds checking.
Array bounds are checked automatically by the CLR in order to prevent things like buffer overflows from occurring. Every time that you access an array in safe code you will in fact have a comparison occur to insure that you are within the range of the array. If you are not within a valid range an IndexOutOfRangeException will occur as opposed to writing happily beyond the end of your array as many other language such as C would do.
The problem with array bounds checks is that they are extremely redundant when dealing with loops. Consider the following code that shows what is happening.
static void SampleArrayBoundsCheck() { int total = 0; for (int i = 0; i <> if (i <> total |= foo[ i ]; } } } |
Listing 9: Sample Array Bounds Code |
Naturally you are not issuing these checks in your code, but this example helps to make what’s really going on a bit clearer. When looking at this code anyone who has read the first few chapters of a C# book would scratch their head and wonder why all of the redundancy has been placed into the code. It is obvious that if our counter has a constraint to stay below foo.Length that it will in fact always succeed the conditional of being below foo.Length. To test how intelligent the JIT is with handling this situation we can use the following code. Note that for this test we will simply be looking at the native code generated as opposed to measuring performance.
static int [] SampleArrayBoundsCheck() { int [] Destination = new int[foo.Length]; for (int i = 0; i <> Destination[ i ]= foo[ i ]; } return Destination; } |
Listing 10: Test to see if array bound checks hoist |
This code is simply copying the static array that we have been using previously to another array. This code should be particularly good to test as it in fact has two bounds checks occurring within it (one for each array).
static int [] SampleArrayBoundsCheck() { int [] Destination = new int[foo.Length]; 00000000 push edi 00000001 push esi 00000002 push ebx 00000003 push ebp 00000004 push eax 00000005 mov edi,dword ptr ds:[022B1EC4h] 0000000b mov ebx,dword ptr [edi+4] 0000000e mov edx,ebx 00000010 mov ecx,7915982Ah 00000015 call FFB21FC0 0000001a mov esi,eax for (int i = 0; i <> 0000001c xor edx,edx 0000001e test ebx,ebx 00000020 jle 00000045 00000022 mov ebp,dword ptr [edi+4] 00000025 mov eax,dword ptr [esi+4] 00000028 mov dword ptr [esp],eax Destination[ i ]= foo[ i ]; 0000002b cmp edx,ebp 0000002d jae 0000004D 0000002f mov ecx,dword ptr [edi+edx*4+8] 00000033 mov eax,dword ptr [esp] 00000036 cmp edx,eax 00000038 jae 0000004D 0000003a mov dword ptr [esi+edx*4+8],ecx for (int i = 0; i <> 0000003e add edx,1 00000041 cmp ebx,edx 00000043 jg 0000002B } return Destination; 00000045 mov eax,esi 00000047 pop ecx 00000048 pop ebp 00000049 pop ebx 0000004a pop esi 0000004b pop edi 0000004c ret 0000004d call 792B4511 00000052 int 3 |
Listing 11: Disassembly of simple array bound checks |
Unfortunately even this most trivial of examples does not hoist the array bounds checks. The jumps can clearly be seen on lines 2D and 38. As to why this does not work I am not sure but perhaps it is because code like this could exist which could cause a buffer overflow? Perhaps the amount of time detect this situation has been deemed too much.
static int [] SampleArrayBoundsCheck() { int [] Destination = new int[foo.Length]; for (int i = 0; i <> i += 100000000; Destination[ i ]= foo[ i ]; } return Destination; } |
Listing 12: Buffer overfow example |
Checking before applying the optimization would require a good amount of overhead and if it did not check to make sure anyone was messing with our counter in the interior of the loop then we could run into code like this which would cause a buffer overflow (who knows what we just wrote over, maybe we would get an exception or maybe we just caused a funny character in a string some place, boy do I miss the days of C/Pascal in embedded systems J). Let’s try something even simpler, Brad Abrams says it works so it must?
static int SampleArrayConstantBoundsCheck() { int foo2 = 0; for (int i = 0; i <> foo2 = foo[ i ]; } return foo2; } |
static int SampleArrayConstantBoundsCheck() { int foo2 = 0; 00000000 push esi for (int i = 0; i <> 00000001 xor edx,edx 00000003 mov ecx,dword ptr ds:[022B1EC4h] 00000009 mov esi,dword ptr [ecx+4] foo2 = foo[ i ]; 0000000c cmp edx,esi 0000000e jae 00000021 00000010 mov eax,dword ptr [ecx+edx*4+8] for (int i = 0; i <> 00000014 add edx,1 00000017 cmp edx,3E8h 0000001d jl 0000000C 0000001f pop esi } return foo2; 00000020 ret 00000021 call 792B44A9 00000026 int 3 |
Listing 13: Simplest possible example |
Again, no love from the JIT optimizer, we cannot possibly make this example any simpler but it still has the bounds checks firmly placed within our loop. The JIT optimizer does not hoist array bounds checks for you. My guess is that it will not do this do to threading reasons, I would assume the process to actually be thread safe (as it is only copying something of reference size which is assured to be atomic but I am probably missing something. There is some similar functionality which happens that I think may cause the confusion that array bounds hoist are actually occurring. Let’s take a look at this other optimization, but first we can make a new rule.
If you are dealing with array accesses in a highly performant area and the bounds checks being in the loop are too much. You will have to handle your iteration in unsafe code to remove the bounds checks.
Local Array Bounds Check Removal
The optimization that the JIT does support is removing bounds checks for locally created arrays. Since the array has been created locally and is only known by a local reference, the JIT can be sure that the array cannot possibly change. In these cases the JIT will completely remove bounds checks. Let’s take a look at an example.
static int TestLocalArrayBoundsCheckRemoval() { int [] Test = new int[10000]; int total = 0; for(int i=0;i total |= Test[ i ]; } return total; } |
static int TestLocalArrayBoundsCheckRemoval() { int[] Test = new int[10000]; 00000000 push esi 00000001 mov edx,2710h 00000006 mov ecx,7915982Ah 0000000b call FFB21FF0 00000010 mov ecx,eax int total = 0; 00000012 xor esi,esi for(int i=0;i 00000014 xor edx,edx 00000016 mov eax,dword ptr [ecx+4] 00000019 test eax,eax 0000001b jle 00000028 total |= Test[ i ]; 0000001d or esi,dword ptr [ecx+edx*4+8] for(int i=0;i 00000021 add edx,1 00000024 cmp eax,edx 00000026 jg 0000001D } return total; 00000028 mov eax,esi 0000002a pop esi 0000002b ret |
Listing 14: Removal of local array bounds |
What is really neat here is that we have in fact created unmanaged code which does not even check for out of bounds access. This code will never throw an exception (it doesn’t even have code to throw an exception). This optimization is even better than hoisting as it has no overhead, the hoisting would make it a single comparison; this had no comparison. My guess is that people were getting confused between this and hoisting. If you do not use a length property or a constant in your loop this optimization will not occur, as an example the following code will not remove the bounds checks
static int SampleArrayConstantBoundsCheckRemoval() { int foo2 = 0; int size = foo.GetUpperBound(0) ; for (int i = 0; i <> foo2 = foo[ i ]; } return foo2; } |
static int TestLocalArrayBoundsCheckRemoval() { int[] Test = new int[10000]; 00000000 push edi 00000001 push esi 00000002 mov edx,2710h 00000007 mov ecx,7915982Ah 0000000c call FFB21FF0 00000011 mov esi,eax int total = 0; 00000013 xor edi,edi int size = Test.GetUpperBound(0); 00000015 mov ecx,esi 00000017 xor edx,edx 00000019 cmp dword ptr [ecx],ecx 0000001b call 79266720 00000020 mov edx,eax for(int i=0;i 00000022 xor eax,eax 00000024 test edx,edx 00000026 jle 0000003A 00000028 mov ecx,dword ptr [esi+4] total |= Test[ i ]; 0000002b cmp eax,ecx 0000002d jae 0000003F 0000002f or edi,dword ptr [esi+eax*4+8] for(int i=0;i 00000033 add eax,1 00000036 cmp eax,edx 00000038 jl 0000002B } return total; 0000003a mov eax,edi 0000003c pop esi 0000003d pop edi 0000003e ret 0000003f call 792B4541 00000044 int 3 |
Listing 15: Manual hoisting causes Array Bounds Check Removal to fail |
As we can see the manual hoisting caused the optimization to fail. So manual hoisting is not always a good thing as it can break some JIT patterns such as this. This optimization will also work for writes.
static int [] TestLocalArrayBoundsCheckWrite() { int[] Test = new int[10000]; for(int i=0;i Test[ i ] = i; } return Test; } |
static int [] TestLocalArrayBoundsCheckWrite() { int[] Test = new int[10000]; 00000000 mov edx,2710h 00000005 mov ecx,7915982Ah 0000000a call FFB21FF0 0000000f mov ecx,eax for(int i=0;i 00000011 xor edx,edx 00000013 mov eax,dword ptr [ecx+4] 00000016 test eax,eax 00000018 jle 00000025 Test[ i ] = i; 0000001a mov dword ptr [ecx+edx*4+8],edx for(int i=0;i 0000001e add edx,1 00000021 cmp eax,edx 00000023 jg 0000001A } return Test; 00000025 mov eax,ecx 00000027 ret |
Listing 16: Bounds Check Removal for writes |
As we can see, it again removed all of the bounds checking. Of course as we have seen earlier, this will only work for an array that is created in the same method. If we move the creation of the array out of the method this optimization will no longer occur. This leads us to our third rule.
Try to keep creation and initialization of arrays within the same method as the JIT can often times remove the bounds checks during your initialization.
Another oddity
Were you one of the people I said “good catch” to earlier? Remember those odd “Dummy” variables were using in the first set of examples
The JIT is extremely smart in some optimizations. For example it realizes if you are doing calculations and never use the result, it will remove the code for you. It does however have a slight problem when dealing with loops. Let’s try removing those Dummy calls from our previous code and see what happens. Let’s take a look at only one of them since we have already shown that they produce identical results.
static void Test1NoHoist() { int total = 0; for (int i = 0; i <> total|=i; } } |
static void Test1NoHoist() { int total = 0; 00000000 xor edx,edx for (int i = 0; i <> 00000002 mov eax,dword ptr ds:[022B1EC4h] 00000007 mov eax,dword ptr [eax+4] 0000000a test eax,eax 0000000c jle 00000015 0000000e add edx,1 00000011 cmp eax,edx 00000013 jg 0000000E 00000015 ret |
Figure 17: Our method without the dummy value being set |
I have color coded the example as with the rest of the example. You will notice that the blue section is missing. The JIT optimizer has correctly realized that we were in fact never using our total variable after we performed all of the calculation on it. It has however failed to realize that by taking out our code, it has in fact left a loop which does nothing. Nothing super interesting here, this is just something I came across while writing this.
Summary
We have looked through quite a few JIT optimizations in this post and have made a few rules which that can help us in situations where measuring becomes very difficult as we need to know what to measure against. Let’s go back through our rules, keep in mind that these rules are not version agnostic so using 1.x, mono, or rotor you will likely get different results!
1) If you wish to use a method call for your stop condition that does anything more complex than simply returning a variable or is not inlinable for other reasons such as it being virtual, you should hoist it manually in order by placing it in the first part of your for loop to make it explicit to the JIT that you do not wish the method to be called on every iteration.
2) When dealing with items that are enumerable as opposed to dealing with arrays or collections; prefer foreach to a for loop as the enumerator will often offer a much faster way of enumerating than using an index.
3) If you are dealing with array access in a highly performant area and the bounds checks being in the loop are too much overhead. You will have to handle your iteration in unsafe code to remove the bounds checks (in other words, it is impossible to write performant looping code in VB.NET)
4) Try to keep creation and initialization of arrays within the same method as the JIT can often times remove the bounds checks during your initialization.
Hopefully by following these rules you can save yourself some time the next time you are optimizing code as many of these rules deal with things that are difficult to measure, but remember measurement is key.
I hope you enjoyed reading about these optimizations as much as I enjoyed writing about them.
Reference :- http://codebetter.com/blogs/gregyoung/archive/2006/06/11/146343.aspx