I have a program that
- allocates two blocks of memory of size n, using HeapAlloc;
- moves the content of block 1 to block 2 with the following code:
mov ecx, n
shr ecx, 2
xor ebx, ebx
mov eax, [esi + ebx] ; esi and edi are both a multiple of 256
mov [edi + ebx], eax
add ebx, 4
When I run this code on a Win XP 1.9 Ghz P4, the code uses many more CPU-cycles than when I run it on a Win 2K, 1.7 Ghz P4.
This is, if n is about 400,000 ar larger. The larger n, the greater the difference (eg, when n=10,000,000 the difference is about a factor 1.5)
If n < 400,000, the number of cycles is about the same on both machines, something I'd expect for any n.
Both CPU's are the exactly the same with regard to 1st and 2nd level cache sizes and cache line sizes. I tested this with the CPUID instruction.
I can hardly imagine that it is an OS-issue, but it sure seems like it.
Anyone have an idea on this?