Loop Unrolling Optimization Question - Programmers Heaven

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories

Welcome to the new platform of Programmer's Heaven! We apologize for the inconvenience caused, if you visited us from a broken link of the previous version. The main reason to move to a new platform is to provide more effective and collaborative experience to you all. Please feel free to experience the new platform and use its exciting features. Contact us for any issue that you need to get clarified. We are more than happy to help you.

Loop Unrolling Optimization Question

I have two versions of an iterative procedure, one that has rolled-up loops and one that is unrolled. The computation involves three nested loops. By my calculations, the rolled-up version requires 90,460 cycles, whereas the unrolled loop version has a calculated latency of 56,400 cycles.

However, in real execution, the rolled version is running 2-2.5 times faster. My question is why is the rolled version, with the higher latency, executing faster? Am I getting 'beaten' by instruction caching?

For example, the rolled version has much short code, and each time through the loops, the instructions are the same....only register contents change. For the unrolled version, the actual memory references are changing, so the much longer code cannot be cached.

By the way, this is running on a 1200 MHz Athlon (Thunderbird), with ABIT KT7E mainboard.

Any comments regarding this optimization issue are welcome....if it is a caching issue, how much 'unrolling' can be done before caching beats the unrolling?

Thanks,

John

Comments

  • markjolesenmarkjolesen Posts: 4Member
    I am guessing you are running under windows, if so, it can screw everything up. Especially when and how you are accessing memory. Take a look at the number of page faults, etc... I guess what I am saying is, you need to really analyze what your code is doing and what you want to accomplish. If there is branches, then branch prediction prefetch can take the wrong path and can cause a stall...who knows? Show the code...

    : I have two versions of an iterative procedure, one that has rolled-up loops and one that is unrolled. The computation involves three nested loops. By my calculations, the rolled-up version requires 90,460 cycles, whereas the unrolled loop version has a calculated latency of 56,400 cycles.
    :
    : However, in real execution, the rolled version is running 2-2.5 times faster. My question is why is the rolled version, with the higher latency, executing faster? Am I getting 'beaten' by instruction caching?
    :
    : For example, the rolled version has much short code, and each time through the loops, the instructions are the same....only register contents change. For the unrolled version, the actual memory references are changing, so the much longer code cannot be cached.
    :
    : By the way, this is running on a 1200 MHz Athlon (Thunderbird), with ABIT KT7E mainboard.
    :
    : Any comments regarding this optimization issue are welcome....if it is a caching issue, how much 'unrolling' can be done before caching beats the unrolling?
    :
    : Thanks,
    :
    : John
    :

  • MedoMedo Posts: 18Member
    I'm quite a beginner at optimization, but maybe this could work:

    If you do a read access at the beginning of each cacheline that loads the content of a part of the next cacheline into an unused register, the next line will probably be loaded to the L2 cache without losing much speed, because the memory access has no dependencies.

    I don't know if this will bring much speed, or any at all. But if it works, please write back...
  • dsbsciencedsbscience Posts: 4Member
    : If you do a read access at the beginning of each cacheline that loads the content of a part of the next cacheline into an unused register, the next line will probably be loaded to the L2 cache without losing much speed, because the memory access has no dependencies.
    :


    Actually, I was hoping to do both runs without any caching....just to be sure I am comparing apples and apples. One of the routines (it is a 20x20 matrix multiplication) is very amenable to caching benefits and the other is not....

    Thanks!

  • dsbsciencedsbscience Posts: 4Member
    :Show the code...


    It is just a 20x20 matrix multiplication ... the 4x4 unrolled code flies compared to the rolled version, the larger one does not...that's why I was thinking caching was the issue.

    Thanks!

    John
  • anqeanqe Posts: 1Member
    [b][red]This message was edited by anqe at 2002-12-4 4:27:34[/red][/b][hr]
    If your compiler/assembler is generating poor machine code then unrolling may not be of any benefit.

    visit AMD's website and download their code analyst software and the "AMD Athlon x86 Code Optimization Guide", some machine code instructions may need to be enterred manually to overrride the assembler generated instructiuons.

    Like the manual says, avoid vector path instr. and arrange the code out of order to hide any memory latency as much as possible.

    Ensure that any constant moves (that can be) are replaced with a clear and add imediate structure.

    Don't always optimize based on execution latency, take into account the decoder and register dependancies. Also note that optimizing for your hardware is far more productive than optimizing for other peoples hardware :)

    If you have a K7 don't worry about the Pentium 4, most K7 optmizations have no effect or a negative effect on the P4 (especially shift and (i)mul).





Sign In or Register to comment.