help me! it's slow...

This is my sourcecode for bitblt(32bit), but it's too
slow, my sourcecode is here:

Blt32 proc lpDst:DWoRD,lpSrc:DWORD,PixelsperLine:DWORD,Lines:DWORD
mov edx,PixelperLine
shl edx,2 ;*4
mov esi,lpSrc
mov edi,lpDst
mov ebx,Lines
xor ecx,ecx
StartX:
mov eax,[esi+ecx]
mov [edi+ecx],eax ;if i delete '+ecx', it's much fast.
add ecx,4
cmp ecx,edx
jl StartX

xor ecx,ecx
add esi,edx 'Move to Next Src Line
add edi,edx 'Move to Next Dst Line
dec ebx
jnz StartX
ret

What's the problem make it so slow? if i delete '+ecx', it's much fast.
Is it CACHE data missing or AGI problem? How to make it faster? Thanks!

Comments

  • : This is my sourcecode for bitblt(32bit), but it's too
    : slow, my sourcecode is here:
    :
    : Blt32 proc lpDst:DWoRD,lpSrc:DWORD,PixelsperLine:DWORD,Lines:DWORD
    : mov edx,PixelperLine
    : shl edx,2 ;*4
    : mov esi,lpSrc
    : mov edi,lpDst
    : mov ebx,Lines
    : xor ecx,ecx
    : StartX:
    : mov eax,[esi+ecx]
    : mov [edi+ecx],eax ;if i delete '+ecx', it's much fast.
    : add ecx,4
    : cmp ecx,edx
    : jl StartX
    :
    : xor ecx,ecx
    : add esi,edx 'Move to Next Src Line
    : add edi,edx 'Move to Next Dst Line
    : dec ebx
    : jnz StartX
    : ret
    :
    : What's the problem make it so slow? if i delete '+ecx', it's much fast.
    : Is it CACHE data missing or AGI problem? How to make it faster? Thanks!
    :
    [blue]What system are you coding for?
    Usually, bitmap transfer is made by a REP MOVSD instruction.[/blue]
  • : : This is my sourcecode for bitblt(32bit), but it's too
    : : slow, my sourcecode is here:
    : :
    : : Blt32 proc lpDst:DWoRD,lpSrc:DWORD,PixelsperLine:DWORD,Lines:DWORD
    : : mov edx,PixelperLine
    : : shl edx,2 ;*4
    : : mov esi,lpSrc
    : : mov edi,lpDst
    : : mov ebx,Lines
    : : xor ecx,ecx
    : : StartX:
    : : mov eax,[esi+ecx]
    : : mov [edi+ecx],eax ;if i delete '+ecx', it's much fast.
    : : add ecx,4
    : : cmp ecx,edx
    : : jl StartX
    : :
    : : xor ecx,ecx
    : : add esi,edx 'Move to Next Src Line
    : : add edi,edx 'Move to Next Dst Line
    : : dec ebx
    : : jnz StartX
    : : ret
    : :
    : : What's the problem make it so slow? if i delete '+ecx', it's much fast.
    : : Is it CACHE data missing or AGI problem? How to make it faster? Thanks!
    : :
    : [blue]What system are you coding for?
    : Usually, bitmap transfer is made by a REP MOVSD instruction.[/blue]
    :

    yes,i tried to use REP MOVSD,Stosd ,but is the same - as fast as mov [edi+ecx],eax. (complied under win98 masm6.0).
  • : : : This is my sourcecode for bitblt(32bit), but it's too
    : : : slow, my sourcecode is here:
    : : :
    : : : Blt32 proc lpDst:DWoRD,lpSrc:DWORD,PixelsperLine:DWORD,Lines:DWORD
    : : : mov edx,PixelperLine
    : : : shl edx,2 ;*4
    : : : mov esi,lpSrc
    : : : mov edi,lpDst
    : : : mov ebx,Lines
    : : : xor ecx,ecx
    : : : StartX:
    : : : mov eax,[esi+ecx]
    : : : mov [edi+ecx],eax ;if i delete '+ecx', it's much fast.
    : : : add ecx,4
    : : : cmp ecx,edx
    : : : jl StartX
    : : :
    : : : xor ecx,ecx
    : : : add esi,edx 'Move to Next Src Line
    : : : add edi,edx 'Move to Next Dst Line
    : : : dec ebx
    : : : jnz StartX
    : : : ret
    : : :
    : : : What's the problem make it so slow? if i delete '+ecx', it's much fast.
    : : : Is it CACHE data missing or AGI problem? How to make it faster? Thanks!
    : : :
    : : [blue]What system are you coding for?
    : : Usually, bitmap transfer is made by a REP MOVSD instruction.[/blue]
    : :
    :
    : yes,i tried to use REP MOVSD,Stosd ,but is the same - as fast as mov [edi+ecx],eax. (complied under win98 masm6.0).
    :
    [blue]What system? DOS, WINDOWS, *NIX?[/blue]
  • ECX is often the loop counter, &
    EBX is the index pointer in [brackets]
    you have them reversed.

    You ADD INDEX=ECX,4 where INC ECX 4 times might be faster

    mov [edi+ecx],eax ;using an +index causes additional clocks
    STOSD ;mov EAX to [ES:DI] & ADD EDI,4 is fast ass code
    As AsmGuru62 stated: the instruction loop REPZ STOSD is fast
    code in a 32 bit computer, but you're MOVing data, so

    Do your math first and get the amount of Dwords to move in ECX
    then move them from [DS:SI] to [ES:DI]
    (you know DI is EDI & SI is ESI etc)
    REPZ MOVSD ;will move it all at once & this is the ultimate speedstar

    The ugly facts is: once it's done right, the code is smaller than the
    call & return stuff to the proc, so you're better off with a macro,
    or just write the REPZ MOVSB code instead of haveing a called proc.
    Also a proc call wastes time if you haven't aligned it on a 16 byte boundry.
    The CPU has to adjust the address to something it can use for each call
    to an unalligned proc.
    If you debug or disASM it,
    find the proc address,
    and the last HEX digit is a 0=zero, it's aligned properly.
    align 16 ;Nasm code aligns procs on page? paragraph? 16 byte boundries.
    PROC: ;here
    Yours is NOT aligned.

    Also using Registers to hold values is the fastest,
    but your proc stores values in memory (locals?) Mem is slow.
    Many procs have input in reg form so them there regs is already
    loaded to do the fast math, er fast what ever.

    cmp ecx,edx
    jl StartX
    is slow code, the CMP takes some time
    DEC reg32 ;put loop count in a reg32 or 16 or 8
    JNZ LOOP_TOP ; (loop count in ECX & LOOP TOP is standard)
    dec is fast tight code too, so is JNZ is a short jump,
    it's tight, hot, fast, & easy.
    (Sounds like a good date huh.)

    I can't concentrate any more, so
    I hope that helped some how.
    Bitdog

  • Thanks for your help.My Bitblt is faster then before,but a little. :)
    I have a test, the result is:

    ;///////////////////////////////////////////////////////////////
    ; Tst.asm
    ;///////////////////////////////////////////////////////////////
    .486
    .model flat,stdcall
    option casemap:none
    include masm32includewindows.inc
    .code

    DllEntry proc hInstDLL:HINSTANCE, reason:DWORD, reserved1:DWORD
    mov eax,TRUE
    ret
    DllEntry Endp

    ;---------------------------------------
    ; Function: BitBlt
    ;---------------------------------------
    Tst proc pDstBegin:DWORD,pSrcBegin,nCount:DWORD
    push esi
    push edi

    xor ebx,ebx
    mov esi,pSrcBegin
    mov edi,pDstBegin
    mov ecx,nCount

    ;StartX: ;cost 19ms (timeGettime Windows API)
    ; mov eax,[esi+ebx]
    ; mov [edi+ebx],eax
    ; add ebx,4
    ; dec ecx
    ; jnz StartX

    ;StartX: ;cost 20.5ms
    ; mov eax,[esi]
    ; mov [edi],eax
    ; add esi,4
    ; add edi,4
    ; dec ecx
    ; jnz StartX

    ;StartX: ;cost 20.5ms
    ; mov eax,[esi+ebx]
    ; stosd
    ; add ebx,4
    ; dec ecx
    ; jnz StartX

    ;StartX: ;cost 14.5ms
    ; rep movsd

    GetOut:
    pop edi
    pop esi
    ret
    Tst endp
    End DllEntry

    ;///////////////////////////////////////////////////////////////
    ; Tst.def
    ;///////////////////////////////////////////////////////////////
    LIBRARY Tst
    EXPORTS Tst


    If i ALIGN my proc, the time that it cost is almost the same(about 19-20ms).
    REP is the fastest way,but it only for bitblt or CopyMemory(contiguous data block), if it's Stretchblt?
    I think it may be optimize problem - cache miss,Imperfect pairing,or other things.
  • First, your timing checking is real commendable, congrats.

    Ok, REPZ MOVSD is the fastest & simplest, BUT
    what if the address you are moving to and from arn't aligned at 4
    0, 4, 8, 12=C, 14=E
    Maybe that would help.
    If a little math to align data is done before a loop (REPZ is a loop)
    the loop time can be cut down.
    Bytes don't have to be aligned but WORDS on up could/should be aligned
    for speed, since the CPU has to do extra work when getting unaligned at 2 for words, at 4 for Dwords, etc
    If you have a loop that moves alot of data.
    You should align it during programming or
    unknown data address you can align at runtime.
    If you want to align something at 16, you add 15 & remove the bits that held the value 15. This works for every alignment. So lets do a simple one. ALIGN at 4
    ADD AL,3
    AND AL,11111100b ;discard bits 0 & 1
    the decimal value 3 is written in binary as 11b
    So adding 3 basicly increments AL from bit 2 on up. (8 bits are 0-7)
    Then removing the bits aligns it at 4.
    (we were rounding up to the next align 4 address)

    If I wanted to know how far off AL was from ALIGN 4
    MOV AH,AL
    AND AH,00000011b ;keep bits 0-1,
    so AH holds the off alignment value now.

    Aligning your proc may help a little or a lot.
    I mean if the proc is already aligned and you align it,
    there is no difference in clocks saved.
    But you never know if it's aligned unless you debug or disassemble your
    program and check each proc label, and it's easier to put ALIGN 16
    infront of every Nasm proc.
    Also, if you add any code infront of data, the alignment changes.
    So if you test for time, then alter, you need to retest & adding ALIGN 16 and forgetting about it is so much easier. I don't know what MASM uses, EVEN for align 2, & probably a million made up names for you to remember.
    If a proc alignment is way off and you align it,
    it only speeds up the call and that isn't a biggie.
    Unless it's called from a loop & that means it could be called thousands of times & that's a time waster.

    Also with such small code as REPZ MOVSD you don't need a proc.
    if you CALL a proc, PUSH some regs, do a REPZ MOVSD, POP regs, & RETurn
    You've created boatware that MS would be proud of.

    Any way there's a few more ideas to play with.
    Feel free to post again fur shur.

    [green]Bitdog[/green]

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories