Memory bandwidth deficiency.?

Memory bandwidth deficiency.?


Procesor: AMD AthlonXP Thorton Core 2.0GHz
chipset: VIA KT600
Memory: DDR1 1.5GB
FSB: 133/266 MHz

SiSoftware Sandra 2002 benchmark result: 1.7 GB/s


DWORD * mem1=(DWORD*)malloc(1920000);//800*600*4 screen resolution
DWORD * mem2=(DWORD*)malloc(1920000);

void Copy1(){
int i;
for(i=0;i<480000;i++){
mem1[i]=mem2[i];
}
}

One thousand of Copy1 lasts 6.65 seconds which gives 563 MB/s
in single way from memory to procesor.
In Copy1 there are two ways, from memory to procesor
and then from procesor to memory.
(I used GetTickCount() function.)

void Copy2(){
_asm{
pusha
mov ecx,480000
mov edi,[mem1]
mov esi,[mem2]
}

l:
_asm{
mov eax,[edi]
add edi,4
mov [esi],eax
add esi,4
dec ecx
jnz l

popa
}
}

One thousand of Copy2 lasts 5.9 seconds which gives 621 MB/s
which is slitely better.

My question is where is the rest from 600MB/s to 1.7GB/s ?

If there is a technick in assembler to utilize cache to
achieve best speeds please point it out to me.

Thanks.

Comments

  • Try the standard memcpy(), a decent compiler will have it well optimized, though it probably won't gain too much speed.
    For more speed, try using [link=http://en.wikipedia.org/wiki/MMX_(instruction_set)]MMX[/link] registers instead of GPRs, these have the benefit of being larger (64 vs. 32 bits). MMX typically isn't used in copy functions, because it would break compatibility with older machines. Newer computationally intensive applications such as benchmarks will try to detect (using [link=http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25481.pdf]CPUID[/link]) whether they can use MMX/[link=http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions]SSE[/link]/[link=http://en.wikipedia.org/wiki/3DNow!]3DNow![/link] and then use the best available option.
    Prefetching will also speed things up, read about it here:
    http://cdrom.amd.com/devconn/events/gdc_2002_amd.pdf
  • I'll rephrase my question.

    How do I make 16 bytes,that is 64 bits DDR1 bus width * 2 data rate, copy to cache(L1 best) at once?

  • That's a very different question than the original. I'm no expert, but afaik, the Athlon has 64-byte cache lines, so if you have the 16 bytes of data aligned to within a 64-byte boundary, then you should have no problem getting it into cache at once. Align the data to 16 bytes if you want to fit the DDR bandwidth as well, but it doesn't matter, because the entire cache line width needs to be copied anyway.
  • Also try

    for(i=480000; i!=0; i--)


    If your compiler is daft, it won't make that optimization.
  • The range is from 479999 to 0 (including both values) so I think
    for(i=480000; i!=0; i--)
    is not correct.
    Also I tried it and it's much slower.(?)
    I am using Microsoft Visual Studio 6.

    void Copy3(){
    int i;
    for(i=479999; i>=0; i--){
    mem1[i]=mem2[i];
    }
    }

    Also now I realise that Sandra 2002 result is mmx transfer.
    Sandra 2001 gives the result of 670 MB/s.
    But still Where is the rest?
    DDR1 memory bandwith is:
    133MHZ FSB * 2 data rate * 64 bits bus width =2128 MB/s.
  • I did some tests and my conclusion is that the difference is impossible to measure on a sluggish PC. The algorithm with best luck wins, because of the context switches.

    With enough iterations you can perhaps compensate for that. I managed to get slightly better result with Copy3() than Copy1(). I couldn't get the inline asm to work in gcc, I suppose it doesn't like intel syntax or something...

    Here is my test code:

    [code]#include
    #include

    static DWORD * mem1=(DWORD*)malloc(1920000);//800*600*4 screen resolution
    static DWORD * mem2=(DWORD*)malloc(1920000);

    void Copy1()
    {
    int i;
    for(i=0;i<480000;i++)
    {
    mem1[i] = mem2[i];
    }
    }

    void Copy3()
    {
    int i;
    for(i=479999; i>=0; i--)
    {
    mem1[i] = mem2[i];
    }
    }

    int main (void)
    {
    DWORD tick;
    DWORD i;
    DWORD total_ticks;
    const int iterations = 10000;

    total_ticks = 0;
    for(i=0; i<iterations; i++)
    {
    memset(mem1, 0, 1920000);
    memset(mem2, 0, 1920000);
    tick = GetTickCount();
    Copy1();
    tick = GetTickCount() - tick;
    total_ticks += tick;
    }
    printf("Copy1: %d ms
    ", total_ticks);

    total_ticks = 0;
    for(i=0; i<iterations; i++)
    {
    memset(mem1, 0, 1920000);
    memset(mem2, 0, 1920000);
    tick = GetTickCount();
    Copy3();
    tick = GetTickCount() - tick;
    total_ticks += tick;
    }
    printf("Copy3: %d ms
    ", total_ticks);

    total_ticks = 0;
    for(i=0; i<iterations; i++)
    {
    memset(mem1, 0, 1920000);
    memset(mem2, 0, 1920000);
    tick = GetTickCount();
    Copy3();
    tick = GetTickCount() - tick;
    total_ticks += tick;
    }
    printf("Copy3 again: %d ms
    ", total_ticks);

    total_ticks = 0;
    for(i=0; i<iterations; i++)
    {
    memset(mem1, 0, 1920000);
    memset(mem2, 0, 1920000);
    tick = GetTickCount();
    Copy1();
    tick = GetTickCount() - tick;
    total_ticks += tick;
    }
    printf("Copy1 again: %d ms
    ", total_ticks);

    getchar();
    return 0;
    }[/code]
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories