Use MMX and SIMD in Delphi

Hi, how can we use MMX and SIMD benefits in our Delphi programs in order to speed up the applications?

Comments

  • : Hi, how can we use MMX and SIMD benefits in our Delphi programs in order to speed up the applications?
    :
    SIMD can only be used if you want to perform integer calculations, since the MMn registers are mapped onto the FPU registers. For floating point operations, MMX will only slow the calculation down, since not all FPU registers can be used and the values need to be retrieved from the cache or the main memory.
  • : : Hi, how can we use MMX and SIMD benefits in our Delphi programs in order to speed up the applications?
    : :
    : SIMD can only be used if you want to perform integer calculations, since the MMn registers are mapped onto the FPU registers. For floating point operations, MMX will only slow the calculation down, since not all FPU registers can be used and the values need to be retrieved from the cache or the main memory.
    :
    ??
    Since MMX is integer, it can't use floating point.
    How can it slow down then?

    There are other alternatives, like SSE.
  • : : : Hi, how can we use MMX and SIMD benefits in our Delphi programs in order to speed up the applications?
    : : :
    : : SIMD can only be used if you want to perform integer calculations, since the MMn registers are mapped onto the FPU registers. For floating point operations, MMX will only slow the calculation down, since not all FPU registers can be used and the values need to be retrieved from the cache or the main memory.
    : :
    : ??
    : Since MMX is integer, it can't use floating point.
    : How can it slow down then?
    :
    : There are other alternatives, like SSE.
    :
    It uses the floating point registers, thus the processor cannot store the floating points in them. That means that more floating point values must be loaded from the cache or the RAM memory. Those storages have a larger latency than the registers, which means that the processor cannot continue the calculation for several cycles while it is waiting for that data. And a waiting processor slows down floating point calculations. Since hadipardis's previous post was about optimalizing floating point calculations, this point becomes very relevant.
  • : It uses the floating point registers, thus the processor cannot store the floating points in them. That means that more floating point values must be loaded from the cache or the RAM memory. Those storages have a larger latency than the registers, which means that the processor cannot continue the calculation for several cycles while it is waiting for that data. And a waiting processor slows down floating point calculations. Since hadipardis's previous post was about optimalizing floating point calculations, this point becomes very relevant.
    :

    OK

    As I said there are other alternatives,
    SSE, SSE2, 3DNOW! and some others for different processors.
  • : : It uses the floating point registers, thus the processor cannot store the floating points in them. That means that more floating point values must be loaded from the cache or the RAM memory. Those storages have a larger latency than the registers, which means that the processor cannot continue the calculation for several cycles while it is waiting for that data. And a waiting processor slows down floating point calculations. Since hadipardis's previous post was about optimalizing floating point calculations, this point becomes very relevant.
    : :
    :
    : OK
    :
    : As I said there are other alternatives,
    : SSE, SSE2, 3DNOW! and some others for different processors.
    :
    Thanks. Anyway, I am looking for a fast way to multiply 2 double matrix. Do you have any idea for me?
  • : : : It uses the floating point registers, thus the processor cannot store the floating points in them. That means that more floating point values must be loaded from the cache or the RAM memory. Those storages have a larger latency than the registers, which means that the processor cannot continue the calculation for several cycles while it is waiting for that data. And a waiting processor slows down floating point calculations. Since hadipardis's previous post was about optimalizing floating point calculations, this point becomes very relevant.
    : : :
    : :
    : : OK
    : :
    : : As I said there are other alternatives,
    : : SSE, SSE2, 3DNOW! and some others for different processors.
    : :
    : Thanks. Anyway, I am looking for a fast way to multiply 2 double matrix. Do you have any idea for me?
    :
    You could use Strassen's algorithm for really big matrices (about n>7). If you change your matrix storage and use assembly to write your code, you can also use pipe-lining to speed up the process. There are also other algorithms, which are fast for dense or sparse matrices, but I don't know which. All of these algorithms will greatly increase the memory usage, and might not produce a significant increase in speed under the windows OS (due to multi-treading).
  • : : : : It uses the floating point registers, thus the processor cannot store the floating points in them. That means that more floating point values must be loaded from the cache or the RAM memory. Those storages have a larger latency than the registers, which means that the processor cannot continue the calculation for several cycles while it is waiting for that data. And a waiting processor slows down floating point calculations. Since hadipardis's previous post was about optimalizing floating point calculations, this point becomes very relevant.
    : : : :
    : : :
    : : : OK
    : : :
    : : : As I said there are other alternatives,
    : : : SSE, SSE2, 3DNOW! and some others for different processors.
    : : :
    : : Thanks. Anyway, I am looking for a fast way to multiply 2 double matrix. Do you have any idea for me?
    : :
    : You could use Strassen's algorithm for really big matrices (about n>7). If you change your matrix storage and use assembly to write your code, you can also use pipe-lining to speed up the process. There are also other algorithms, which are fast for dense or sparse matrices, but I don't know which. All of these algorithms will greatly increase the memory usage, and might not produce a significant increase in speed under the windows OS (due to multi-treading).
    :
    But what is the pipe-lining?(please describe it briefly) Could you please give me an example of it? For example, suppose that we want to multiply 2 double values using pipe-lining.Thank you very much!
  • : : : : : It uses the floating point registers, thus the processor cannot store the floating points in them. That means that more floating point values must be loaded from the cache or the RAM memory. Those storages have a larger latency than the registers, which means that the processor cannot continue the calculation for several cycles while it is waiting for that data. And a waiting processor slows down floating point calculations. Since hadipardis's previous post was about optimalizing floating point calculations, this point becomes very relevant.
    : : : : :
    : : : :
    : : : : OK
    : : : :
    : : : : As I said there are other alternatives,
    : : : : SSE, SSE2, 3DNOW! and some others for different processors.
    : : : :
    : : : Thanks. Anyway, I am looking for a fast way to multiply 2 double matrix. Do you have any idea for me?
    : : :
    : : You could use Strassen's algorithm for really big matrices (about n>7). If you change your matrix storage and use assembly to write your code, you can also use pipe-lining to speed up the process. There are also other algorithms, which are fast for dense or sparse matrices, but I don't know which. All of these algorithms will greatly increase the memory usage, and might not produce a significant increase in speed under the windows OS (due to multi-treading).
    : :
    : But what is the pipe-lining?(please describe it briefly) Could you please give me an example of it? For example, suppose that we want to multiply 2 double values using pipe-lining.Thank you very much!
    :
    Processors can perform a limited set of operations simultaneously. Most of them can get a small number of bytes, perform an operation and set a small number of bytes in 1 clock-circle.
    This feature can be used to speed up processes, which have a lot of values. In this case the processor gets one or two values each clock circle. If the first value (or set of values) reaches the processor it performs a single calculation and the next circle the results are written back to the memory. If you make a time-graph of it, it would look like this:
    [code]
    Circle Get Calc Set
    1 A, B
    2 C, D
    3 E, F A, B
    4 G, H C, D A, B
    5 I, J E, F C, D
    6 K, L G, H E, F
    7 M, N I, J G, H
    8 O, P K, L I, J
    9 Q, R M, N K, L
    10 S, T O, P M, N
    11 U, V Q, R O, P
    12 W, X S, T Q, R
    13 Y, Z U, V S, T
    14 W, X U, V
    15 Y, Z W, X
    16 Y, Z
    [/code]
    where each letter-combination represets the values set the processor processes.
    For a non-pipe-lining code the time-graph for the same algorithm would look like this:
    [code]
    Circle Get Calc Set
    1 A, B
    2
    3 A, B
    4 A, B
    5 C, D
    6
    7 C, D
    8 C, D
    9 E, F
    10
    11 E, F
    12 E, F
    13 G, H
    14
    15 G, H
    16 G, H
    [/code]
    Pipe-lining is a very commonly used method of speeding-up calculations in high-performance computing. The time-gain of it depends on the latency between the memory and the processor.
    More info can be found here: http://en.wikipedia.org/wiki/Instruction_pipeline
  • : : : : : : It uses the floating point registers, thus the processor cannot store the floating points in them. That means that more floating point values must be loaded from the cache or the RAM memory. Those storages have a larger latency than the registers, which means that the processor cannot continue the calculation for several cycles while it is waiting for that data. And a waiting processor slows down floating point calculations. Since hadipardis's previous post was about optimalizing floating point calculations, this point becomes very relevant.
    : : : : : :
    : : : : :
    : : : : : OK
    : : : : :
    : : : : : As I said there are other alternatives,
    : : : : : SSE, SSE2, 3DNOW! and some others for different processors.
    : : : : :
    : : : : Thanks. Anyway, I am looking for a fast way to multiply 2 double matrix. Do you have any idea for me?
    : : : :
    : : : You could use Strassen's algorithm for really big matrices (about n>7). If you change your matrix storage and use assembly to write your code, you can also use pipe-lining to speed up the process. There are also other algorithms, which are fast for dense or sparse matrices, but I don't know which. All of these algorithms will greatly increase the memory usage, and might not produce a significant increase in speed under the windows OS (due to multi-treading).
    : : :
    : : But what is the pipe-lining?(please describe it briefly) Could you please give me an example of it? For example, suppose that we want to multiply 2 double values using pipe-lining.Thank you very much!
    : :
    : Processors can perform a limited set of operations simultaneously. Most of them can get a small number of bytes, perform an operation and set a small number of bytes in 1 clock-circle.
    : This feature can be used to speed up processes, which have a lot of values. In this case the processor gets one or two values each clock circle. If the first value (or set of values) reaches the processor it performs a single calculation and the next circle the results are written back to the memory. If you make a time-graph of it, it would look like this:
    : [code]
    : Circle Get Calc Set
    : 1 A, B
    : 2 C, D
    : 3 E, F A, B
    : 4 G, H C, D A, B
    : 5 I, J E, F C, D
    : 6 K, L G, H E, F
    : 7 M, N I, J G, H
    : 8 O, P K, L I, J
    : 9 Q, R M, N K, L
    : 10 S, T O, P M, N
    : 11 U, V Q, R O, P
    : 12 W, X S, T Q, R
    : 13 Y, Z U, V S, T
    : 14 W, X U, V
    : 15 Y, Z W, X
    : 16 Y, Z
    : [/code]
    : where each letter-combination represets the values set the processor processes.
    : For a non-pipe-lining code the time-graph for the same algorithm would look like this:
    : [code]
    : Circle Get Calc Set
    : 1 A, B
    : 2
    : 3 A, B
    : 4 A, B
    : 5 C, D
    : 6
    : 7 C, D
    : 8 C, D
    : 9 E, F
    : 10
    : 11 E, F
    : 12 E, F
    : 13 G, H
    : 14
    : 15 G, H
    : 16 G, H
    : [/code]
    : Pipe-lining is a very commonly used method of speeding-up calculations in high-performance computing. The time-gain of it depends on the latency between the memory and the processor.
    : More info can be found here: http://en.wikipedia.org/wiki/Instruction_pipeline
    :
    Many thanks for your good explonation.But could you please give me an example to multiply 2 double values using pipe-lining in Delphi. In fact, I want to know how can we exploit the pipe-lining benefits in Delphi and I so the best way is to give a simple example. Also why Delphi doesn't support the pipe-lining internally?
  • : : : : : : : It uses the floating point registers, thus the processor cannot store the floating points in them. That means that more floating point values must be loaded from the cache or the RAM memory. Those storages have a larger latency than the registers, which means that the processor cannot continue the calculation for several cycles while it is waiting for that data. And a waiting processor slows down floating point calculations. Since hadipardis's previous post was about optimalizing floating point calculations, this point becomes very relevant.
    : : : : : : :
    : : : : : :
    : : : : : : OK
    : : : : : :
    : : : : : : As I said there are other alternatives,
    : : : : : : SSE, SSE2, 3DNOW! and some others for different processors.
    : : : : : :
    : : : : : Thanks. Anyway, I am looking for a fast way to multiply 2 double matrix. Do you have any idea for me?
    : : : : :
    : : : : You could use Strassen's algorithm for really big matrices (about n>7). If you change your matrix storage and use assembly to write your code, you can also use pipe-lining to speed up the process. There are also other algorithms, which are fast for dense or sparse matrices, but I don't know which. All of these algorithms will greatly increase the memory usage, and might not produce a significant increase in speed under the windows OS (due to multi-treading).
    : : : :
    : : : But what is the pipe-lining?(please describe it briefly) Could you please give me an example of it? For example, suppose that we want to multiply 2 double values using pipe-lining.Thank you very much!
    : : :
    : : Processors can perform a limited set of operations simultaneously. Most of them can get a small number of bytes, perform an operation and set a small number of bytes in 1 clock-circle.
    : : This feature can be used to speed up processes, which have a lot of values. In this case the processor gets one or two values each clock circle. If the first value (or set of values) reaches the processor it performs a single calculation and the next circle the results are written back to the memory. If you make a time-graph of it, it would look like this:
    : : [code]
    : : Circle Get Calc Set
    : : 1 A, B
    : : 2 C, D
    : : 3 E, F A, B
    : : 4 G, H C, D A, B
    : : 5 I, J E, F C, D
    : : 6 K, L G, H E, F
    : : 7 M, N I, J G, H
    : : 8 O, P K, L I, J
    : : 9 Q, R M, N K, L
    : : 10 S, T O, P M, N
    : : 11 U, V Q, R O, P
    : : 12 W, X S, T Q, R
    : : 13 Y, Z U, V S, T
    : : 14 W, X U, V
    : : 15 Y, Z W, X
    : : 16 Y, Z
    : : [/code]
    : : where each letter-combination represets the values set the processor processes.
    : : For a non-pipe-lining code the time-graph for the same algorithm would look like this:
    : : [code]
    : : Circle Get Calc Set
    : : 1 A, B
    : : 2
    : : 3 A, B
    : : 4 A, B
    : : 5 C, D
    : : 6
    : : 7 C, D
    : : 8 C, D
    : : 9 E, F
    : : 10
    : : 11 E, F
    : : 12 E, F
    : : 13 G, H
    : : 14
    : : 15 G, H
    : : 16 G, H
    : : [/code]
    : : Pipe-lining is a very commonly used method of speeding-up calculations in high-performance computing. The time-gain of it depends on the latency between the memory and the processor.
    : : More info can be found here: http://en.wikipedia.org/wiki/Instruction_pipeline
    : :
    : Many thanks for your good explonation.But could you please give me an example to multiply 2 double values using pipe-lining in Delphi. In fact, I want to know how can we exploit the pipe-lining benefits in Delphi and I so the best way is to give a simple example. Also why Delphi doesn't support the pipe-lining internally?
    :
    Pipe-lining requires assemby, and I don't know any. Also any example involving just 2 numbers will only reduce the performance, not increase it.
    Because pipe-lining only works for applying the same instruction to a large number of values (often 1000+ values), it is not much use for "normal" Windows programs, but only for real number crunching applications. Also pipe-lining won't work well in multi-threaded environments, because of the thread jumps, which break-up the pipe-line.
  • : : Many thanks for your good explonation.But could you please give me an example to multiply 2 double values using pipe-lining in Delphi. In fact, I want to know how can we exploit the pipe-lining benefits in Delphi and I so the best way is to give a simple example. Also why Delphi doesn't support the pipe-lining internally?
    : :
    : Pipe-lining requires assemby, and I don't know any. Also any example involving just 2 numbers will only reduce the performance, not increase it.
    : Because pipe-lining only works for applying the same instruction to a large number of values (often 1000+ values), it is not much use for "normal" Windows programs, but only for real number crunching applications. Also pipe-lining won't work well in multi-threaded environments, because of the thread jumps, which break-up the pipe-line.
    :

    Most times it's possible to trust fate and hope that it won't be interupted. And if it is, it will only stall at that point, not later.
    It will always be faster to pipeline.

    I don't think it's possible to do any inline assembly in delphi, but I think it's possible to link it.

    It isn't possible to pipeline one instruction, becouse pipelining only works on multiple values simultaniusly.

    It's always good to pipeline, but it will take a lot of thought to really make it good, so in most cases it isn't worth it.
  • : : : Many thanks for your good explonation.But could you please give me an example to multiply 2 double values using pipe-lining in Delphi. In fact, I want to know how can we exploit the pipe-lining benefits in Delphi and I so the best way is to give a simple example. Also why Delphi doesn't support the pipe-lining internally?
    : : :
    : : Pipe-lining requires assemby, and I don't know any. Also any example involving just 2 numbers will only reduce the performance, not increase it.
    : : Because pipe-lining only works for applying the same instruction to a large number of values (often 1000+ values), it is not much use for "normal" Windows programs, but only for real number crunching applications. Also pipe-lining won't work well in multi-threaded environments, because of the thread jumps, which break-up the pipe-line.
    : :
    :
    : Most times it's possible to trust fate and hope that it won't be interupted. And if it is, it will only stall at that point, not later.
    : It will always be faster to pipeline.
    :
    : I don't think it's possible to do any inline assembly in delphi, but I think it's possible to link it.
    :
    : It isn't possible to pipeline one instruction, becouse pipelining only works on multiple values simultaniusly.
    :
    : It's always good to pipeline, but it will take a lot of thought to really make it good, so in most cases it isn't worth it.
    :
    It is possible to place assembly code into Delphi code, although it is no longer possible to place true inline codes (i.e. hexidecimal instructions) into the code as it was with TP.
  • : : : : Many thanks for your good explonation.But could you please give me an example to multiply 2 double values using pipe-lining in Delphi. In fact, I want to know how can we exploit the pipe-lining benefits in Delphi and I so the best way is to give a simple example. Also why Delphi doesn't support the pipe-lining internally?
    : : : :
    : : : Pipe-lining requires assemby, and I don't know any. Also any example involving just 2 numbers will only reduce the performance, not increase it.
    : : : Because pipe-lining only works for applying the same instruction to a large number of values (often 1000+ values), it is not much use for "normal" Windows programs, but only for real number crunching applications. Also pipe-lining won't work well in multi-threaded environments, because of the thread jumps, which break-up the pipe-line.
    : : :
    : :
    : : Most times it's possible to trust fate and hope that it won't be interupted. And if it is, it will only stall at that point, not later.
    : : It will always be faster to pipeline.
    : :
    : : I don't think it's possible to do any inline assembly in delphi, but I think it's possible to link it.
    : :
    : : It isn't possible to pipeline one instruction, becouse pipelining only works on multiple values simultaniusly.
    : :
    : : It's always good to pipeline, but it will take a lot of thought to really make it good, so in most cases it isn't worth it.
    : :
    : It is possible to place assembly code into Delphi code, although it is no longer possible to place true inline codes (i.e. hexidecimal instructions) into the code as it was with TP.
    :
    Is it possible to 'db'(declare byte)?

    If it is, then it's possible to enter opcodes.
    Like this:
    db 0,0,0,0
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Categories