Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories

XP/p4 performance issue

tsagldtsagld Member Posts: 621
Hi,

I have a program that
- allocates two blocks of memory of size n, using HeapAlloc;
- moves the content of block 1 to block 2 with the following code:
[code]
mov ecx, n
shr ecx, 2
xor ebx, ebx
n0:
mov eax, [esi + ebx] ; esi and edi are both a multiple of 256
mov [edi + ebx], eax
add ebx, 4
dec ecx
jnz n0
[/code]

Problem:
When I run this code on a Win XP 1.9 Ghz P4, the code uses many more CPU-cycles than when I run it on a Win 2K, 1.7 Ghz P4.
This is, if n is about 400,000 ar larger. The larger n, the greater the difference (eg, when n=10,000,000 the difference is about a factor 1.5)
If n < 400,000, the number of cycles is about the same on both machines, something I'd expect for any n.

Both CPU's are the exactly the same with regard to 1st and 2nd level cache sizes and cache line sizes. I tested this with the CPUID instruction.
I can hardly imagine that it is an OS-issue, but it sure seems like it.

Anyone have an idea on this?


Greets,
Eric Goldstein
www.gvh-maatwerk.nl

Comments

  • tsagldtsagld Member Posts: 621
    : Hi,
    :
    : I have a program that
    : - allocates two blocks of memory of size n, using HeapAlloc;
    : - moves the content of block 1 to block 2 with the following code:
    : [code]
    : mov ecx, n
    : shr ecx, 2
    : xor ebx, ebx
    : n0:
    : mov eax, [esi + ebx] ; esi and edi are both a multiple of 256
    : mov [edi + ebx], eax
    : add ebx, 4
    : dec ecx
    : jnz n0
    : [/code]
    :
    : Problem:
    : When I run this code on a Win XP 1.9 Ghz P4, the code uses many more CPU-cycles than when I run it on a Win 2K, 1.7 Ghz P4.
    : This is, if n is about 400,000 ar larger. The larger n, the greater the difference (eg, when n=10,000,000 the difference is about a factor 1.5)
    : If n < 400,000, the number of cycles is about the same on both machines, something I'd expect for any n.
    :
    : Both CPU's are the exactly the same with regard to 1st and 2nd level cache sizes and cache line sizes. I tested this with the CPUID instruction.
    : I can hardly imagine that it is an OS-issue, but it sure seems like it.
    :
    : Anyone have an idea on this?
    :
    :
    : Greets,
    : Eric Goldstein
    : www.gvh-maatwerk.nl
    :


    Greets,
    Eric Goldstein
    www.gvh-maatwerk.nl

  • AsmGuru62AsmGuru62 Member Posts: 6,519
    : : Hi,
    : :
    : : I have a program that
    : : - allocates two blocks of memory of size n, using HeapAlloc;
    : : - moves the content of block 1 to block 2 with the following code:
    : : [code]
    : : mov ecx, n
    : : shr ecx, 2
    : : xor ebx, ebx
    : : n0:
    : : mov eax, [esi + ebx] ; esi and edi are both a multiple of 256
    : : mov [edi + ebx], eax
    : : add ebx, 4
    : : dec ecx
    : : jnz n0
    : : [/code]
    : :
    : : Problem:
    : : When I run this code on a Win XP 1.9 Ghz P4, the code uses many more CPU-cycles than when I run it on a Win 2K, 1.7 Ghz P4.
    : : This is, if n is about 400,000 ar larger. The larger n, the greater the difference (eg, when n=10,000,000 the difference is about a factor 1.5)
    : : If n < 400,000, the number of cycles is about the same on both machines, something I'd expect for any n.
    : :
    : : Both CPU's are the exactly the same with regard to 1st and 2nd level cache sizes and cache line sizes. I tested this with the CPUID instruction.
    : : I can hardly imagine that it is an OS-issue, but it sure seems like it.
    : :
    : : Anyone have an idea on this?
    : :
    : :
    : : Greets,
    : : Eric Goldstein
    : : www.gvh-maatwerk.nl
    : :
    :
    :
    : Greets,
    : Eric Goldstein
    : www.gvh-maatwerk.nl
    :
    :
    [blue]I was interested in that one myself - why people using that thing with two moves:

    [b]MOV EAX, [ADDRESS1]
    MOV [ADDRESS2], EAX[/b]

    when the perfect MOVSD instruction exist?!.. It also, moves ESI and EDI for you!

    Well, I talked to some gurus... looks like on very large blocks - like - yours - this code is significantly slower. There is no cause found. The code was measured and MOVSD is faster. Try it the usual way:

    [b]mov esi, address1
    mov edi, address2
    cld
    mov ecx, n
    shr ecx, 2
    rep movsd[/b][/blue]
  • tsagldtsagld Member Posts: 621
    [b][red]This message was edited by tsagld at 2002-11-4 9:42:51[/red][/b][hr]
    : : : Hi,
    : : :
    : : : I have a program that
    : : : - allocates two blocks of memory of size n, using HeapAlloc;
    : : : - moves the content of block 1 to block 2 with the following code:
    : : : [code]
    : : : mov ecx, n
    : : : shr ecx, 2
    : : : xor ebx, ebx
    : : : n0:
    : : : mov eax, [esi + ebx] ; esi and edi are both a multiple of 256
    : : : mov [edi + ebx], eax
    : : : add ebx, 4
    : : : dec ecx
    : : : jnz n0
    : : : [/code]
    : : :
    : : : Problem:
    : : : When I run this code on a Win XP 1.9 Ghz P4, the code uses many more CPU-cycles than when I run it on a Win 2K, 1.7 Ghz P4.
    : : : This is, if n is about 400,000 ar larger. The larger n, the greater the difference (eg, when n=10,000,000 the difference is about a factor 1.5)
    : : : If n < 400,000, the number of cycles is about the same on both machines, something I'd expect for any n.
    : : :
    : : : Both CPU's are the exactly the same with regard to 1st and 2nd level cache sizes and cache line sizes. I tested this with the CPUID instruction.
    : : : I can hardly imagine that it is an OS-issue, but it sure seems like it.
    : : :
    : : : Anyone have an idea on this?
    : : :
    : : :
    : : : Greets,
    : : : Eric Goldstein
    : : : www.gvh-maatwerk.nl
    : : :
    : :
    : :
    : : Greets,
    : : Eric Goldstein
    : : www.gvh-maatwerk.nl
    : :
    : :
    : [blue]I was interested in that one myself - why people using that thing with two moves:
    :
    : [b]MOV EAX, [ADDRESS1]
    : MOV [ADDRESS2], EAX[/b]
    :
    : when the perfect MOVSD instruction exist?!.. It also, moves ESI and EDI for you!
    :
    : Well, I talked to some gurus... looks like on very large blocks - like - yours - this code is significantly slower. There is no cause found. The code was measured and MOVSD is faster. Try it the usual way:
    :
    : [b]mov esi, address1
    : mov edi, address2
    : cld
    : mov ecx, n
    : shr ecx, 2
    : rep movsd[/b][/blue]
    :
    Offcourse MOVSD is faster, but that is not the issue here.
    It can be even faster using the P4's MOVDQA or MOVDQU instructions, which moves 16 bytes in one clock-cycle. In fact, I am using that for my program.
    The thing is, that is doesn't seem to matter whether I move one, two, four, 8 (using MMX instructions) or 16 (using P4's XMM instructions) bytes in one instruction.
    The difference in performance stays. It seems to be some kind of memory- or processor-issue.

    Nice for you to know, perhaps:
    To move 16 bytes at a time:
    [code]
    mov ecx, n
    shr ecx, 4
    xor ebx, ebx
    n0:
    movdqu xmm0, [esi + ebx] ; esi and edi are both a multiple of 256
    movdqu [edi + ebx], xmm0
    add ebx, 16
    dec ecx
    jnz n0
    [/code]

    You will need the latest processor pack from M$, when using MSVC 6.0. Free download.


    Greets,
    Eric Goldstein
    www.gvh-maatwerk.nl



  • eikedehlingeikedehling Member Posts: 123
    : [b][red]This message was edited by tsagld at 2002-11-4 9:42:51[/red][/b][hr]
    : : : : Hi,
    : : : :
    : : : : I have a program that
    : : : : - allocates two blocks of memory of size n, using HeapAlloc;
    : : : : - moves the content of block 1 to block 2 with the following code:
    : : : : [code]
    : : : : mov ecx, n
    : : : : shr ecx, 2
    : : : : xor ebx, ebx
    : : : : n0:
    : : : : mov eax, [esi + ebx] ; esi and edi are both a multiple of 256
    : : : : mov [edi + ebx], eax
    : : : : add ebx, 4
    : : : : dec ecx
    : : : : jnz n0
    : : : : [/code]
    : : : :
    : : : : Problem:
    : : : : When I run this code on a Win XP 1.9 Ghz P4, the code uses many more CPU-cycles than when I run it on a Win 2K, 1.7 Ghz P4.
    : : : : This is, if n is about 400,000 ar larger. The larger n, the greater the difference (eg, when n=10,000,000 the difference is about a factor 1.5)
    : : : : If n < 400,000, the number of cycles is about the same on both machines, something I'd expect for any n.
    : : : :
    : : : : Both CPU's are the exactly the same with regard to 1st and 2nd level cache sizes and cache line sizes. I tested this with the CPUID instruction.
    : : : : I can hardly imagine that it is an OS-issue, but it sure seems like it.
    : : : :
    : : : : Anyone have an idea on this?
    : : : :
    : : : :
    : : : : Greets,
    : : : : Eric Goldstein
    : : : : www.gvh-maatwerk.nl
    : : : :
    : : :
    : : :
    : : : Greets,
    : : : Eric Goldstein
    : : : www.gvh-maatwerk.nl
    : : :
    : : :
    : : [blue]I was interested in that one myself - why people using that thing with two moves:
    : :
    : : [b]MOV EAX, [ADDRESS1]
    : : MOV [ADDRESS2], EAX[/b]
    : :
    : : when the perfect MOVSD instruction exist?!.. It also, moves ESI and EDI for you!
    : :
    : : Well, I talked to some gurus... looks like on very large blocks - like - yours - this code is significantly slower. There is no cause found. The code was measured and MOVSD is faster. Try it the usual way:
    : :
    : : [b]mov esi, address1
    : : mov edi, address2
    : : cld
    : : mov ecx, n
    : : shr ecx, 2
    : : rep movsd[/b][/blue]
    : :
    : Offcourse MOVSD is faster, but that is not the issue here.
    : It can be even faster using the P4's MOVDQA or MOVDQU instructions, which moves 16 bytes in one clock-cycle. In fact, I am using that for my program.
    : The thing is, that is doesn't seem to matter whether I move one, two, four, 8 (using MMX instructions) or 16 (using P4's XMM instructions) bytes in one instruction.
    : The difference in performance stays. It seems to be some kind of memory- or processor-issue.
    :
    : Nice for you to know, perhaps:
    : To move 16 bytes at a time:
    : [code]
    : mov ecx, n
    : shr ecx, 4
    : xor ebx, ebx
    : n0:
    : movdqu xmm0, [esi + ebx] ; esi and edi are both a multiple of 256
    : movdqu [edi + ebx], xmm0
    : add ebx, 16
    : dec ecx
    : jnz n0
    : [/code]
    :
    : You will need the latest processor pack from M$, when using MSVC 6.0. Free download.
    :
    :
    : Greets,
    : Eric Goldstein
    : www.gvh-maatwerk.nl
    :
    When you use NORMAL or XMM or MMX or FPU registers in your code, the OS has to save them at a task switch.

    -> For normal registers This is done automatically by the CPU. (not in windows, cause it's faster to save them by hand, but thats a side note.)

    -> For the XMM and MMX and FPU registers there are special mechanisms in the x86 to detect wether they were saved since the last task-switch (it raises an exception if they are used before they were saved, like the page-fault)

    So windows will not have to save any 'special' registers before any other program which also uses them gets activated. If on w2k now there were no such program which uses special registers, w2k would not save them, cause the mechanism makes it unnecessary to save the registers. If then on winXP there were such a program, WinXP would have to save/restore your programs registers all the time. That takes very very long, and explains the behavior you encountered. Especially if you consider that a "short" copy is more likely to be completed before a task-switch occurs! Because then only "long" copies would take abnormally long.

    HMMmmm. now i got cramps in my fingers :) Hope this gives you a clue.

    Eike.

    SUSE LINUX 7.3 PRO - The world starts behind windows

  • tsagldtsagld Member Posts: 621
    : : [b][red]This message was edited by tsagld at 2002-11-4 9:42:51[/red][/b][hr]
    : : : : : Hi,
    : : : : :
    : : : : : I have a program that
    : : : : : - allocates two blocks of memory of size n, using HeapAlloc;
    : : : : : - moves the content of block 1 to block 2 with the following code:
    : : : : : [code]
    : : : : : mov ecx, n
    : : : : : shr ecx, 2
    : : : : : xor ebx, ebx
    : : : : : n0:
    : : : : : mov eax, [esi + ebx] ; esi and edi are both a multiple of 256
    : : : : : mov [edi + ebx], eax
    : : : : : add ebx, 4
    : : : : : dec ecx
    : : : : : jnz n0
    : : : : : [/code]
    : : : : :
    : : : : : Problem:
    : : : : : When I run this code on a Win XP 1.9 Ghz P4, the code uses many more CPU-cycles than when I run it on a Win 2K, 1.7 Ghz P4.
    : : : : : This is, if n is about 400,000 ar larger. The larger n, the greater the difference (eg, when n=10,000,000 the difference is about a factor 1.5)
    : : : : : If n < 400,000, the number of cycles is about the same on both machines, something I'd expect for any n.
    : : : : :
    : : : : : Both CPU's are the exactly the same with regard to 1st and 2nd level cache sizes and cache line sizes. I tested this with the CPUID instruction.
    : : : : : I can hardly imagine that it is an OS-issue, but it sure seems like it.
    : : : : :
    : : : : : Anyone have an idea on this?
    : : : : :
    : : : : :
    : : : : : Greets,
    : : : : : Eric Goldstein
    : : : : : www.gvh-maatwerk.nl
    : : : : :
    : : : :
    : : : :
    : : : : Greets,
    : : : : Eric Goldstein
    : : : : www.gvh-maatwerk.nl
    : : : :
    : : : :
    : : : [blue]I was interested in that one myself - why people using that thing with two moves:
    : : :
    : : : [b]MOV EAX, [ADDRESS1]
    : : : MOV [ADDRESS2], EAX[/b]
    : : :
    : : : when the perfect MOVSD instruction exist?!.. It also, moves ESI and EDI for you!
    : : :
    : : : Well, I talked to some gurus... looks like on very large blocks - like - yours - this code is significantly slower. There is no cause found. The code was measured and MOVSD is faster. Try it the usual way:
    : : :
    : : : [b]mov esi, address1
    : : : mov edi, address2
    : : : cld
    : : : mov ecx, n
    : : : shr ecx, 2
    : : : rep movsd[/b][/blue]
    : : :
    : : Offcourse MOVSD is faster, but that is not the issue here.
    : : It can be even faster using the P4's MOVDQA or MOVDQU instructions, which moves 16 bytes in one clock-cycle. In fact, I am using that for my program.
    : : The thing is, that is doesn't seem to matter whether I move one, two, four, 8 (using MMX instructions) or 16 (using P4's XMM instructions) bytes in one instruction.
    : : The difference in performance stays. It seems to be some kind of memory- or processor-issue.
    : :
    : : Nice for you to know, perhaps:
    : : To move 16 bytes at a time:
    : : [code]
    : : mov ecx, n
    : : shr ecx, 4
    : : xor ebx, ebx
    : : n0:
    : : movdqu xmm0, [esi + ebx] ; esi and edi are both a multiple of 256
    : : movdqu [edi + ebx], xmm0
    : : add ebx, 16
    : : dec ecx
    : : jnz n0
    : : [/code]
    : :
    : : You will need the latest processor pack from M$, when using MSVC 6.0. Free download.
    : :
    : :
    : : Greets,
    : : Eric Goldstein
    : : www.gvh-maatwerk.nl
    : :
    : When you use NORMAL or XMM or MMX or FPU registers in your code, the OS has to save them at a task switch.
    :
    : -> For normal registers This is done automatically by the CPU. (not in windows, cause it's faster to save them by hand, but thats a side note.)
    :
    : -> For the XMM and MMX and FPU registers there are special mechanisms in the x86 to detect wether they were saved since the last task-switch (it raises an exception if they are used before they were saved, like the page-fault)
    :
    : So windows will not have to save any 'special' registers before any other program which also uses them gets activated. If on w2k now there were no such program which uses special registers, w2k would not save them, cause the mechanism makes it unnecessary to save the registers. If then on winXP there were such a program, WinXP would have to save/restore your programs registers all the time. That takes very very long, and explains the behavior you encountered. Especially if you consider that a "short" copy is more likely to be completed before a task-switch occurs! Because then only "long" copies would take abnormally long.
    :
    : HMMmmm. now i got cramps in my fingers :) Hope this gives you a clue.
    :
    : Eike.
    :
    : SUSE LINUX 7.3 PRO - The world starts behind windows
    :
    :
    Thanks Eike, but I don't think that explains my issue.
    The difference in performance is abnormal. The origin of my problem is a mathematical program that performs a number of iterations on my 1.7 Ghz Win2K in 9 min 15 secs, but takes 15:30 on the 2.0 Ghz XP.
    It doesn't make any difference if I run my program at the highest possible priority, which minimizes the number of task switches.
    I tracked it down to the read/write memory sample code I provided in my first post. Also, the difference stays if I do only reads or only writes.

    A program that performs the same mathematical task, but from another coder thus different code, doesn't show these symptoms at all. It performs almost exactly 2.0/1.7 times faster on the 2.0 Ghz than on the 1.7 Ghz. According to the coder, it uses the same amount of memory and does about the same number of memory reads and writes.
    Any more ideas?

    (for the origin of the problem, look at www.p196.org, and click My Blackboard).

    Greets,
    Eric Goldstein
    www.gvh-maatwerk.nl

  • mfroebmfroeb Member Posts: 53
    : : : [b][red]This message was edited by tsagld at 2002-11-4 9:42:51[/red][/b][hr]
    : : : : : : Hi,
    : : : : : :
    : : : : : : I have a program that
    : : : : : : - allocates two blocks of memory of size n, using HeapAlloc;
    : : : : : : - moves the content of block 1 to block 2 with the following code:
    : : : : : : [code]
    : : : : : : mov ecx, n
    : : : : : : shr ecx, 2
    : : : : : : xor ebx, ebx
    : : : : : : n0:
    : : : : : : mov eax, [esi + ebx] ; esi and edi are both a multiple of 256
    : : : : : : mov [edi + ebx], eax
    : : : : : : add ebx, 4
    : : : : : : dec ecx
    : : : : : : jnz n0
    : : : : : : [/code]
    : : : : : :
    : : : : : : Problem:
    : : : : : : When I run this code on a Win XP 1.9 Ghz P4, the code uses many more CPU-cycles than when I run it on a Win 2K, 1.7 Ghz P4.
    : : : : : : This is, if n is about 400,000 ar larger. The larger n, the greater the difference (eg, when n=10,000,000 the difference is about a factor 1.5)
    : : : : : : If n < 400,000, the number of cycles is about the same on both machines, something I'd expect for any n.
    : : : : : :
    : : : : : : Both CPU's are the exactly the same with regard to 1st and 2nd level cache sizes and cache line sizes. I tested this with the CPUID instruction.
    : : : : : : I can hardly imagine that it is an OS-issue, but it sure seems like it.
    : : : : : :
    : : : : : : Anyone have an idea on this?
    : : : : : :
    : : : : : :
    : : : : : : Greets,
    : : : : : : Eric Goldstein
    : : : : : : www.gvh-maatwerk.nl
    : : : : : :
    : : : : :
    : : : : :
    : : : : : Greets,
    : : : : : Eric Goldstein
    : : : : : www.gvh-maatwerk.nl
    : : : : :
    : : : : :
    : : : : [blue]I was interested in that one myself - why people using that thing with two moves:
    : : : :
    : : : : [b]MOV EAX, [ADDRESS1]
    : : : : MOV [ADDRESS2], EAX[/b]
    : : : :
    : : : : when the perfect MOVSD instruction exist?!.. It also, moves ESI and EDI for you!
    : : : :
    : : : : Well, I talked to some gurus... looks like on very large blocks - like - yours - this code is significantly slower. There is no cause found. The code was measured and MOVSD is faster. Try it the usual way:
    : : : :
    : : : : [b]mov esi, address1
    : : : : mov edi, address2
    : : : : cld
    : : : : mov ecx, n
    : : : : shr ecx, 2
    : : : : rep movsd[/b][/blue]
    : : : :
    : : : Offcourse MOVSD is faster, but that is not the issue here.
    : : : It can be even faster using the P4's MOVDQA or MOVDQU instructions, which moves 16 bytes in one clock-cycle. In fact, I am using that for my program.
    : : : The thing is, that is doesn't seem to matter whether I move one, two, four, 8 (using MMX instructions) or 16 (using P4's XMM instructions) bytes in one instruction.
    : : : The difference in performance stays. It seems to be some kind of memory- or processor-issue.
    : : :
    : : : Nice for you to know, perhaps:
    : : : To move 16 bytes at a time:
    : : : [code]
    : : : mov ecx, n
    : : : shr ecx, 4
    : : : xor ebx, ebx
    : : : n0:
    : : : movdqu xmm0, [esi + ebx] ; esi and edi are both a multiple of 256
    : : : movdqu [edi + ebx], xmm0
    : : : add ebx, 16
    : : : dec ecx
    : : : jnz n0
    : : : [/code]
    : : :
    : : : You will need the latest processor pack from M$, when using MSVC 6.0. Free download.
    : : :
    : : :
    : : : Greets,
    : : : Eric Goldstein
    : : : www.gvh-maatwerk.nl
    : : :
    : : When you use NORMAL or XMM or MMX or FPU registers in your code, the OS has to save them at a task switch.
    : :
    : : -> For normal registers This is done automatically by the CPU. (not in windows, cause it's faster to save them by hand, but thats a side note.)
    : :
    : : -> For the XMM and MMX and FPU registers there are special mechanisms in the x86 to detect wether they were saved since the last task-switch (it raises an exception if they are used before they were saved, like the page-fault)
    : :
    : : So windows will not have to save any 'special' registers before any other program which also uses them gets activated. If on w2k now there were no such program which uses special registers, w2k would not save them, cause the mechanism makes it unnecessary to save the registers. If then on winXP there were such a program, WinXP would have to save/restore your programs registers all the time. That takes very very long, and explains the behavior you encountered. Especially if you consider that a "short" copy is more likely to be completed before a task-switch occurs! Because then only "long" copies would take abnormally long.
    : :
    : : HMMmmm. now i got cramps in my fingers :) Hope this gives you a clue.
    : :
    : : Eike.
    : :
    : : SUSE LINUX 7.3 PRO - The world starts behind windows
    : :
    : :
    : Thanks Eike, but I don't think that explains my issue.
    : The difference in performance is abnormal. The origin of my problem is a mathematical program that performs a number of iterations on my 1.7 Ghz Win2K in 9 min 15 secs, but takes 15:30 on the 2.0 Ghz XP.
    : It doesn't make any difference if I run my program at the highest possible priority, which minimizes the number of task switches.
    : I tracked it down to the read/write memory sample code I provided in my first post. Also, the difference stays if I do only reads or only writes.
    :
    : A program that performs the same mathematical task, but from another coder thus different code, doesn't show these symptoms at all. It performs almost exactly 2.0/1.7 times faster on the 2.0 Ghz than on the 1.7 Ghz. According to the coder, it uses the same amount of memory and does about the same number of memory reads and writes.
    : Any more ideas?
    :
    : (for the origin of the problem, look at www.p196.org, and click My Blackboard).
    :
    : Greets,
    : Eric Goldstein
    : www.gvh-maatwerk.nl
    :
    :

    Do you have the same memory? Perhaps your memory is faster than his, and so the processor needs less clock cycles to feed the data back to memory. Just an idea ...
  • tsagldtsagld Member Posts: 621
    : : : : [b][red]This message was edited by tsagld at 2002-11-4 9:42:51[/red][/b][hr]
    : : : : : : : Hi,
    : : : : : : :
    : : : : : : : I have a program that
    : : : : : : : - allocates two blocks of memory of size n, using HeapAlloc;
    : : : : : : : - moves the content of block 1 to block 2 with the following code:
    : : : : : : : [code]
    : : : : : : : mov ecx, n
    : : : : : : : shr ecx, 2
    : : : : : : : xor ebx, ebx
    : : : : : : : n0:
    : : : : : : : mov eax, [esi + ebx] ; esi and edi are both a multiple of 256
    : : : : : : : mov [edi + ebx], eax
    : : : : : : : add ebx, 4
    : : : : : : : dec ecx
    : : : : : : : jnz n0
    : : : : : : : [/code]
    : : : : : : :
    : : : : : : : Problem:
    : : : : : : : When I run this code on a Win XP 1.9 Ghz P4, the code uses many more CPU-cycles than when I run it on a Win 2K, 1.7 Ghz P4.
    : : : : : : : This is, if n is about 400,000 ar larger. The larger n, the greater the difference (eg, when n=10,000,000 the difference is about a factor 1.5)
    : : : : : : : If n < 400,000, the number of cycles is about the same on both machines, something I'd expect for any n.
    : : : : : : :
    : : : : : : : Both CPU's are the exactly the same with regard to 1st and 2nd level cache sizes and cache line sizes. I tested this with the CPUID instruction.
    : : : : : : : I can hardly imagine that it is an OS-issue, but it sure seems like it.
    : : : : : : :
    : : : : : : : Anyone have an idea on this?
    : : : : : : :
    : : : : : : :
    : : : : : : : Greets,
    : : : : : : : Eric Goldstein
    : : : : : : : www.gvh-maatwerk.nl
    : : : : : : :
    : : : : : :
    : : : : : :
    : : : : : : Greets,
    : : : : : : Eric Goldstein
    : : : : : : www.gvh-maatwerk.nl
    : : : : : :
    : : : : : :
    : : : : : [blue]I was interested in that one myself - why people using that thing with two moves:
    : : : : :
    : : : : : [b]MOV EAX, [ADDRESS1]
    : : : : : MOV [ADDRESS2], EAX[/b]
    : : : : :
    : : : : : when the perfect MOVSD instruction exist?!.. It also, moves ESI and EDI for you!
    : : : : :
    : : : : : Well, I talked to some gurus... looks like on very large blocks - like - yours - this code is significantly slower. There is no cause found. The code was measured and MOVSD is faster. Try it the usual way:
    : : : : :
    : : : : : [b]mov esi, address1
    : : : : : mov edi, address2
    : : : : : cld
    : : : : : mov ecx, n
    : : : : : shr ecx, 2
    : : : : : rep movsd[/b][/blue]
    : : : : :
    : : : : Offcourse MOVSD is faster, but that is not the issue here.
    : : : : It can be even faster using the P4's MOVDQA or MOVDQU instructions, which moves 16 bytes in one clock-cycle. In fact, I am using that for my program.
    : : : : The thing is, that is doesn't seem to matter whether I move one, two, four, 8 (using MMX instructions) or 16 (using P4's XMM instructions) bytes in one instruction.
    : : : : The difference in performance stays. It seems to be some kind of memory- or processor-issue.
    : : : :
    : : : : Nice for you to know, perhaps:
    : : : : To move 16 bytes at a time:
    : : : : [code]
    : : : : mov ecx, n
    : : : : shr ecx, 4
    : : : : xor ebx, ebx
    : : : : n0:
    : : : : movdqu xmm0, [esi + ebx] ; esi and edi are both a multiple of 256
    : : : : movdqu [edi + ebx], xmm0
    : : : : add ebx, 16
    : : : : dec ecx
    : : : : jnz n0
    : : : : [/code]
    : : : :
    : : : : You will need the latest processor pack from M$, when using MSVC 6.0. Free download.
    : : : :
    : : : :
    : : : : Greets,
    : : : : Eric Goldstein
    : : : : www.gvh-maatwerk.nl
    : : : :
    : : : When you use NORMAL or XMM or MMX or FPU registers in your code, the OS has to save them at a task switch.
    : : :
    : : : -> For normal registers This is done automatically by the CPU. (not in windows, cause it's faster to save them by hand, but thats a side note.)
    : : :
    : : : -> For the XMM and MMX and FPU registers there are special mechanisms in the x86 to detect wether they were saved since the last task-switch (it raises an exception if they are used before they were saved, like the page-fault)
    : : :
    : : : So windows will not have to save any 'special' registers before any other program which also uses them gets activated. If on w2k now there were no such program which uses special registers, w2k would not save them, cause the mechanism makes it unnecessary to save the registers. If then on winXP there were such a program, WinXP would have to save/restore your programs registers all the time. That takes very very long, and explains the behavior you encountered. Especially if you consider that a "short" copy is more likely to be completed before a task-switch occurs! Because then only "long" copies would take abnormally long.
    : : :
    : : : HMMmmm. now i got cramps in my fingers :) Hope this gives you a clue.
    : : :
    : : : Eike.
    : : :
    : : : SUSE LINUX 7.3 PRO - The world starts behind windows
    : : :
    : : :
    : : Thanks Eike, but I don't think that explains my issue.
    : : The difference in performance is abnormal. The origin of my problem is a mathematical program that performs a number of iterations on my 1.7 Ghz Win2K in 9 min 15 secs, but takes 15:30 on the 2.0 Ghz XP.
    : : It doesn't make any difference if I run my program at the highest possible priority, which minimizes the number of task switches.
    : : I tracked it down to the read/write memory sample code I provided in my first post. Also, the difference stays if I do only reads or only writes.
    : :
    : : A program that performs the same mathematical task, but from another coder thus different code, doesn't show these symptoms at all. It performs almost exactly 2.0/1.7 times faster on the 2.0 Ghz than on the 1.7 Ghz. According to the coder, it uses the same amount of memory and does about the same number of memory reads and writes.
    : : Any more ideas?
    : :
    : : (for the origin of the problem, look at www.p196.org, and click My Blackboard).
    : :
    : : Greets,
    : : Eric Goldstein
    : : www.gvh-maatwerk.nl
    : :
    : :
    :
    : Do you have the same memory? Perhaps your memory is faster than his, and so the processor needs less clock cycles to feed the data back to memory. Just an idea ...
    :
    Yes, we're positive that the memory is the same.


    Greets,
    Eric Goldstein
    www.gvh-maatwerk.nl

  • pthackpthack Member Posts: 1
    Have you made absolutely sure that the CPU's are identical apart from speed? (that includes family, model, AND stepping.....)

    Using one of the CPUID tools, get the cpu type and cache descriptors....

    I'll bet you are running into one of the newer CPU's that have a different cache architecture....

    for example...

    some of the newer P4 Celeron (128k cache) CPU's have an 2-way set associative cache vs. the older ones with a 4-way set associative cache. I've seen significantly benchmark results even with identical speed PCU's (with only the stepping, thus the cache architecture, changing).

    You can look at the various flavors of the CPUID instruction (with different parms...0 gives you the GenuineIntel string, 1 gives you the family/model/stepping and features regsiter, and 2 gives you the cache 'tokens'...) to get an idea...

    The cache tokens I was looking at had a 39h for the 128k 4-way SA cache, vs. 3Bh for the 128k 2-way SA cache....the 2-way was 10+% slower on a public benchmark...depending on the memory ops, it could give VERY different speed results that do not scale with a 2.0/1.7 ratio.....
  • tsagldtsagld Member Posts: 621
    : Have you made absolutely sure that the CPU's are identical apart from speed? (that includes family, model, AND stepping.....)
    :
    : Using one of the CPUID tools, get the cpu type and cache descriptors....
    :
    : I'll bet you are running into one of the newer CPU's that have a different cache architecture....
    :
    : for example...
    :
    : some of the newer P4 Celeron (128k cache) CPU's have an 2-way set associative cache vs. the older ones with a 4-way set associative cache. I've seen significantly benchmark results even with identical speed PCU's (with only the stepping, thus the cache architecture, changing).
    :
    : You can look at the various flavors of the CPUID instruction (with different parms...0 gives you the GenuineIntel string, 1 gives you the family/model/stepping and features regsiter, and 2 gives you the cache 'tokens'...) to get an idea...
    :
    : The cache tokens I was looking at had a 39h for the 128k 4-way SA cache, vs. 3Bh for the 128k 2-way SA cache....the 2-way was 10+% slower on a public benchmark...depending on the memory ops, it could give VERY different speed results that do not scale with a 2.0/1.7 ratio.....
    :
    Yes, I am sure the cpu's are identical with regard to stepping and model. I used the CPUID instruction for that. I also used CPUID to view the cache types and sizes. They too are exactly the same.

    The only difference is the 'featureset' reported by Windows, in HKLMHardwareDESCRIPTIONSystemCentralProcessor

    The slow (2.0 Ghz XP) machine says: 0x00073fff
    The fast (1.7 Ghz 2K) machine says: 0x00002fff

    I don't know what the featureset means, but the bits indicate that the 2.0 Ghz has the same and more features than the 1.7 Ghz.

    I also asked Intel. The have no idea and wanted me to swap all kinds of hardware. Won't do that...

    So, the problem still exists...and I really have no clue.


    Greets,
    Eric Goldstein
    www.gvh-maatwerk.nl

  • Justin BibJustin Bib USAMember Posts: 0

    ____ { http://forcoder.org } free video tutorials and ebooks about // Objective-C, Java, C++, Ruby, Python, Assembly, Swift, C, C#, Visual Basic .NET, PL/SQL, JavaScript, Go, PHP, Delphi, R, MATLAB, Perl, Scratch, Visual Basic Clojure, Awk, SAS, COBOL, Lisp, Fortran, Transact-SQL, Erlang, Hack, Apex, ML, Scheme, Logo, D, Alice, Julia, Bash, ABAP, Rust, FoxPro, Prolog, Scala, Crystal, F#, Ada, VBScript, LabVIEW, Lua, Kotlin, Dart // ____

Sign In or Register to comment.