Multiplying big numbers using Long Multiplication

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
5
down vote

favorite












Three weeks ago I posted Multiplying big numbers using Karatsuba's method where I made reference to my version of the classical long multiplication. That's what I'm posting today so people can compare for themselves. At the same time I'm hoping someone still sees ways to further improve this code!



The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.



  • The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.

  • The 2nd param points to the 1st input aka multiplicand.

  • The 3rd param points to the 2nd input aka multiplier.

  • The 4th param points to where the double length result needs to go.

The address of the double length product is returned in the EAX register and the carry flag will be set if the result exceeds the precision.



I applied the following ideas:



  • I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.


  • Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.


  • I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.


  • I try to avoid having to enter the carry-loop as much as possible.


Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$



Computing MAX_UINT^2



; -------------------------------------
; Multiplying using long multiplication
; -------------------------------------
; IN () OUT (eax,CF)
mpMul: pushad
mov ebp, esp
mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
shr edx, 5 ;Bits -> dwords

mov edi, [ebp+32+4+12] ;BufferC (double length result)
mov [ebp+28], edi ;pushad.EAX
push edx edi ;(1)

imul eax, edx, 8
pxor xmm0, xmm0
.Wipe: sub eax, 16 ;Clear result buffer
movdqu [edi+eax], xmm0
jnz .Wipe

mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
call .Prune ; -> ECX ESI EDI
xchg ebx, esi
mov ebp, ecx
call .Prune ; -> ECX ESI EDI

cmp ebp, ecx
jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
xchg esi, ebx
xchg ecx, ebp
; - - - - - - - - - - - - - - - - - -
.MultiplierLoop:
cmp dword [ebx], 0 ;Embedded zero in multiplier
je .Zero

push ecx esi edi ;(2)
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
add [edi-4], eax
adc [edi], edx
jnc .Done
adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
jnc .Done
mov eax, edi
.Carry: add eax, 4 ;The carry-loop
add dword [eax+4], 1 ;(*) +4
jc .Carry
.Done: dec ecx
jnz .MultiplicandLoop
pop edi esi ecx ;(2)

.Zero: add edi, 4
add ebx, 4
dec ebp
jnz .MultiplierLoop
; - - - - - - - - - - - - - - - - - -
.END: pop edi ecx ;(1)
lea edi, [edi+ecx*4] ;Middle of double length result
xor eax, eax
repe scas dword [edi] ;Defines CF

popad
ret 16
; - - - - - - - - - - - - - - - - - -
; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
.Prune: mov ecx, edx ;Precision expressed as dwords (2+)
.Lead: dec ecx ;Skip leading zeroes
js .Quit ;Multiplier and/or Multiplicand is 0
cmp [esi+ecx*4], eax ;EAX=0
je .Lead
inc ecx ;Significant dwords
cmp [esi], eax ;EAX=0
jne .Ready
.Trail: dec ecx ;Skip trailing zeroes
add esi, 4 ;Skip low dwords that are zero
add edi, 4 ; by moving starting points upwards
cmp [esi], eax ;EAX=0
je .Trail
.Ready: ret
.Quit: pop eax ;Forget 'call .Prune'
jmp .END
; -----------------------------------






share|improve this question

























    up vote
    5
    down vote

    favorite












    Three weeks ago I posted Multiplying big numbers using Karatsuba's method where I made reference to my version of the classical long multiplication. That's what I'm posting today so people can compare for themselves. At the same time I'm hoping someone still sees ways to further improve this code!



    The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.



    • The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.

    • The 2nd param points to the 1st input aka multiplicand.

    • The 3rd param points to the 2nd input aka multiplier.

    • The 4th param points to where the double length result needs to go.

    The address of the double length product is returned in the EAX register and the carry flag will be set if the result exceeds the precision.



    I applied the following ideas:



    • I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.


    • Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.


    • I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.


    • I try to avoid having to enter the carry-loop as much as possible.


    Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$



    Computing MAX_UINT^2



    ; -------------------------------------
    ; Multiplying using long multiplication
    ; -------------------------------------
    ; IN () OUT (eax,CF)
    mpMul: pushad
    mov ebp, esp
    mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
    shr edx, 5 ;Bits -> dwords

    mov edi, [ebp+32+4+12] ;BufferC (double length result)
    mov [ebp+28], edi ;pushad.EAX
    push edx edi ;(1)

    imul eax, edx, 8
    pxor xmm0, xmm0
    .Wipe: sub eax, 16 ;Clear result buffer
    movdqu [edi+eax], xmm0
    jnz .Wipe

    mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
    mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
    call .Prune ; -> ECX ESI EDI
    xchg ebx, esi
    mov ebp, ecx
    call .Prune ; -> ECX ESI EDI

    cmp ebp, ecx
    jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
    xchg esi, ebx
    xchg ecx, ebp
    ; - - - - - - - - - - - - - - - - - -
    .MultiplierLoop:
    cmp dword [ebx], 0 ;Embedded zero in multiplier
    je .Zero

    push ecx esi edi ;(2)
    .MultiplicandLoop:
    add edi, 4
    lods dword [esi]
    test eax, eax ;Embedded zero in multiplicand
    jz .Done
    mul dword [ebx]
    add [edi-4], eax
    adc [edi], edx
    jnc .Done
    adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
    jnc .Done
    mov eax, edi
    .Carry: add eax, 4 ;The carry-loop
    add dword [eax+4], 1 ;(*) +4
    jc .Carry
    .Done: dec ecx
    jnz .MultiplicandLoop
    pop edi esi ecx ;(2)

    .Zero: add edi, 4
    add ebx, 4
    dec ebp
    jnz .MultiplierLoop
    ; - - - - - - - - - - - - - - - - - -
    .END: pop edi ecx ;(1)
    lea edi, [edi+ecx*4] ;Middle of double length result
    xor eax, eax
    repe scas dword [edi] ;Defines CF

    popad
    ret 16
    ; - - - - - - - - - - - - - - - - - -
    ; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
    .Prune: mov ecx, edx ;Precision expressed as dwords (2+)
    .Lead: dec ecx ;Skip leading zeroes
    js .Quit ;Multiplier and/or Multiplicand is 0
    cmp [esi+ecx*4], eax ;EAX=0
    je .Lead
    inc ecx ;Significant dwords
    cmp [esi], eax ;EAX=0
    jne .Ready
    .Trail: dec ecx ;Skip trailing zeroes
    add esi, 4 ;Skip low dwords that are zero
    add edi, 4 ; by moving starting points upwards
    cmp [esi], eax ;EAX=0
    je .Trail
    .Ready: ret
    .Quit: pop eax ;Forget 'call .Prune'
    jmp .END
    ; -----------------------------------






    share|improve this question





















      up vote
      5
      down vote

      favorite









      up vote
      5
      down vote

      favorite











      Three weeks ago I posted Multiplying big numbers using Karatsuba's method where I made reference to my version of the classical long multiplication. That's what I'm posting today so people can compare for themselves. At the same time I'm hoping someone still sees ways to further improve this code!



      The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.



      • The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.

      • The 2nd param points to the 1st input aka multiplicand.

      • The 3rd param points to the 2nd input aka multiplier.

      • The 4th param points to where the double length result needs to go.

      The address of the double length product is returned in the EAX register and the carry flag will be set if the result exceeds the precision.



      I applied the following ideas:



      • I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.


      • Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.


      • I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.


      • I try to avoid having to enter the carry-loop as much as possible.


      Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$



      Computing MAX_UINT^2



      ; -------------------------------------
      ; Multiplying using long multiplication
      ; -------------------------------------
      ; IN () OUT (eax,CF)
      mpMul: pushad
      mov ebp, esp
      mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
      shr edx, 5 ;Bits -> dwords

      mov edi, [ebp+32+4+12] ;BufferC (double length result)
      mov [ebp+28], edi ;pushad.EAX
      push edx edi ;(1)

      imul eax, edx, 8
      pxor xmm0, xmm0
      .Wipe: sub eax, 16 ;Clear result buffer
      movdqu [edi+eax], xmm0
      jnz .Wipe

      mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
      mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
      call .Prune ; -> ECX ESI EDI
      xchg ebx, esi
      mov ebp, ecx
      call .Prune ; -> ECX ESI EDI

      cmp ebp, ecx
      jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
      xchg esi, ebx
      xchg ecx, ebp
      ; - - - - - - - - - - - - - - - - - -
      .MultiplierLoop:
      cmp dword [ebx], 0 ;Embedded zero in multiplier
      je .Zero

      push ecx esi edi ;(2)
      .MultiplicandLoop:
      add edi, 4
      lods dword [esi]
      test eax, eax ;Embedded zero in multiplicand
      jz .Done
      mul dword [ebx]
      add [edi-4], eax
      adc [edi], edx
      jnc .Done
      adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
      jnc .Done
      mov eax, edi
      .Carry: add eax, 4 ;The carry-loop
      add dword [eax+4], 1 ;(*) +4
      jc .Carry
      .Done: dec ecx
      jnz .MultiplicandLoop
      pop edi esi ecx ;(2)

      .Zero: add edi, 4
      add ebx, 4
      dec ebp
      jnz .MultiplierLoop
      ; - - - - - - - - - - - - - - - - - -
      .END: pop edi ecx ;(1)
      lea edi, [edi+ecx*4] ;Middle of double length result
      xor eax, eax
      repe scas dword [edi] ;Defines CF

      popad
      ret 16
      ; - - - - - - - - - - - - - - - - - -
      ; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
      .Prune: mov ecx, edx ;Precision expressed as dwords (2+)
      .Lead: dec ecx ;Skip leading zeroes
      js .Quit ;Multiplier and/or Multiplicand is 0
      cmp [esi+ecx*4], eax ;EAX=0
      je .Lead
      inc ecx ;Significant dwords
      cmp [esi], eax ;EAX=0
      jne .Ready
      .Trail: dec ecx ;Skip trailing zeroes
      add esi, 4 ;Skip low dwords that are zero
      add edi, 4 ; by moving starting points upwards
      cmp [esi], eax ;EAX=0
      je .Trail
      .Ready: ret
      .Quit: pop eax ;Forget 'call .Prune'
      jmp .END
      ; -----------------------------------






      share|improve this question











      Three weeks ago I posted Multiplying big numbers using Karatsuba's method where I made reference to my version of the classical long multiplication. That's what I'm posting today so people can compare for themselves. At the same time I'm hoping someone still sees ways to further improve this code!



      The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.



      • The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.

      • The 2nd param points to the 1st input aka multiplicand.

      • The 3rd param points to the 2nd input aka multiplier.

      • The 4th param points to where the double length result needs to go.

      The address of the double length product is returned in the EAX register and the carry flag will be set if the result exceeds the precision.



      I applied the following ideas:



      • I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.


      • Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.


      • I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.


      • I try to avoid having to enter the carry-loop as much as possible.


      Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$



      Computing MAX_UINT^2



      ; -------------------------------------
      ; Multiplying using long multiplication
      ; -------------------------------------
      ; IN () OUT (eax,CF)
      mpMul: pushad
      mov ebp, esp
      mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
      shr edx, 5 ;Bits -> dwords

      mov edi, [ebp+32+4+12] ;BufferC (double length result)
      mov [ebp+28], edi ;pushad.EAX
      push edx edi ;(1)

      imul eax, edx, 8
      pxor xmm0, xmm0
      .Wipe: sub eax, 16 ;Clear result buffer
      movdqu [edi+eax], xmm0
      jnz .Wipe

      mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
      mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
      call .Prune ; -> ECX ESI EDI
      xchg ebx, esi
      mov ebp, ecx
      call .Prune ; -> ECX ESI EDI

      cmp ebp, ecx
      jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
      xchg esi, ebx
      xchg ecx, ebp
      ; - - - - - - - - - - - - - - - - - -
      .MultiplierLoop:
      cmp dword [ebx], 0 ;Embedded zero in multiplier
      je .Zero

      push ecx esi edi ;(2)
      .MultiplicandLoop:
      add edi, 4
      lods dword [esi]
      test eax, eax ;Embedded zero in multiplicand
      jz .Done
      mul dword [ebx]
      add [edi-4], eax
      adc [edi], edx
      jnc .Done
      adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
      jnc .Done
      mov eax, edi
      .Carry: add eax, 4 ;The carry-loop
      add dword [eax+4], 1 ;(*) +4
      jc .Carry
      .Done: dec ecx
      jnz .MultiplicandLoop
      pop edi esi ecx ;(2)

      .Zero: add edi, 4
      add ebx, 4
      dec ebp
      jnz .MultiplierLoop
      ; - - - - - - - - - - - - - - - - - -
      .END: pop edi ecx ;(1)
      lea edi, [edi+ecx*4] ;Middle of double length result
      xor eax, eax
      repe scas dword [edi] ;Defines CF

      popad
      ret 16
      ; - - - - - - - - - - - - - - - - - -
      ; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
      .Prune: mov ecx, edx ;Precision expressed as dwords (2+)
      .Lead: dec ecx ;Skip leading zeroes
      js .Quit ;Multiplier and/or Multiplicand is 0
      cmp [esi+ecx*4], eax ;EAX=0
      je .Lead
      inc ecx ;Significant dwords
      cmp [esi], eax ;EAX=0
      jne .Ready
      .Trail: dec ecx ;Skip trailing zeroes
      add esi, 4 ;Skip low dwords that are zero
      add edi, 4 ; by moving starting points upwards
      cmp [esi], eax ;EAX=0
      je .Trail
      .Ready: ret
      .Quit: pop eax ;Forget 'call .Prune'
      jmp .END
      ; -----------------------------------








      share|improve this question










      share|improve this question




      share|improve this question









      asked Jun 3 at 15:11









      Sep Roland

      1,771617




      1,771617




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          3
          down vote













          Some perf thoughts:




          • Using xchg reg,reg is more expensive than you might expect.


          • pushad isn't great either.

          • I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

          • And while people are dubious about repe scas, it probably isn't an issue here.

          • It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

          It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.



          You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.




          Edit1:



          A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu with movdqa. Tiny, but still.



          Probably even tinier: I might experiment with changing imul eax, edx, 8 to either mov eax, edx ; shl eax, 3 or lea eax, [edx * 8]. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4 better for my code at this point than lea esi, [esi + 4]?






          share|improve this answer























          • All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
            – Sep Roland
            Jun 10 at 21:20

















          up vote
          1
          down vote













          A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.




           mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
          mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)




          B. The Intel optimization manual mentions:




          All branch targets should be 16-bytes aligned.




          You could apply this and see what it does for your code.




          C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.




          .MultiplicandLoop:
          add edi, 4
          lods dword [esi]
          test eax, eax ;Embedded zero in multiplicand
          jz .Done
          mul dword [ebx]



          If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:




          ...detecting embedded zeroes came at no apparant cost...




          Meaning the code between lods and mul got executed for free!



          .MultiplicandLoop:
          lods dword [esi]
          add edi, 4
          mul dword [ebx]





          share|improve this answer





















          • Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
            – Sep Roland
            Jun 10 at 21:25










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195752%2fmultiplying-big-numbers-using-long-multiplication%23new-answer', 'question_page');

          );

          Post as a guest






























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          3
          down vote













          Some perf thoughts:




          • Using xchg reg,reg is more expensive than you might expect.


          • pushad isn't great either.

          • I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

          • And while people are dubious about repe scas, it probably isn't an issue here.

          • It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

          It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.



          You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.




          Edit1:



          A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu with movdqa. Tiny, but still.



          Probably even tinier: I might experiment with changing imul eax, edx, 8 to either mov eax, edx ; shl eax, 3 or lea eax, [edx * 8]. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4 better for my code at this point than lea esi, [esi + 4]?






          share|improve this answer























          • All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
            – Sep Roland
            Jun 10 at 21:20














          up vote
          3
          down vote













          Some perf thoughts:




          • Using xchg reg,reg is more expensive than you might expect.


          • pushad isn't great either.

          • I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

          • And while people are dubious about repe scas, it probably isn't an issue here.

          • It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

          It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.



          You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.




          Edit1:



          A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu with movdqa. Tiny, but still.



          Probably even tinier: I might experiment with changing imul eax, edx, 8 to either mov eax, edx ; shl eax, 3 or lea eax, [edx * 8]. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4 better for my code at this point than lea esi, [esi + 4]?






          share|improve this answer























          • All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
            – Sep Roland
            Jun 10 at 21:20












          up vote
          3
          down vote










          up vote
          3
          down vote









          Some perf thoughts:




          • Using xchg reg,reg is more expensive than you might expect.


          • pushad isn't great either.

          • I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

          • And while people are dubious about repe scas, it probably isn't an issue here.

          • It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

          It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.



          You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.




          Edit1:



          A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu with movdqa. Tiny, but still.



          Probably even tinier: I might experiment with changing imul eax, edx, 8 to either mov eax, edx ; shl eax, 3 or lea eax, [edx * 8]. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4 better for my code at this point than lea esi, [esi + 4]?






          share|improve this answer















          Some perf thoughts:




          • Using xchg reg,reg is more expensive than you might expect.


          • pushad isn't great either.

          • I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

          • And while people are dubious about repe scas, it probably isn't an issue here.

          • It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

          It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.



          You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.




          Edit1:



          A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu with movdqa. Tiny, but still.



          Probably even tinier: I might experiment with changing imul eax, edx, 8 to either mov eax, edx ; shl eax, 3 or lea eax, [edx * 8]. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4 better for my code at this point than lea esi, [esi + 4]?







          share|improve this answer















          share|improve this answer



          share|improve this answer








          edited Jun 4 at 0:30


























          answered Jun 3 at 23:26









          David Wohlferd

          1,2581314




          1,2581314











          • All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
            – Sep Roland
            Jun 10 at 21:20
















          • All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
            – Sep Roland
            Jun 10 at 21:20















          All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
          – Sep Roland
          Jun 10 at 21:20




          All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
          – Sep Roland
          Jun 10 at 21:20












          up vote
          1
          down vote













          A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.




           mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
          mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)




          B. The Intel optimization manual mentions:




          All branch targets should be 16-bytes aligned.




          You could apply this and see what it does for your code.




          C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.




          .MultiplicandLoop:
          add edi, 4
          lods dword [esi]
          test eax, eax ;Embedded zero in multiplicand
          jz .Done
          mul dword [ebx]



          If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:




          ...detecting embedded zeroes came at no apparant cost...




          Meaning the code between lods and mul got executed for free!



          .MultiplicandLoop:
          lods dword [esi]
          add edi, 4
          mul dword [ebx]





          share|improve this answer





















          • Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
            – Sep Roland
            Jun 10 at 21:25














          up vote
          1
          down vote













          A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.




           mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
          mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)




          B. The Intel optimization manual mentions:




          All branch targets should be 16-bytes aligned.




          You could apply this and see what it does for your code.




          C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.




          .MultiplicandLoop:
          add edi, 4
          lods dword [esi]
          test eax, eax ;Embedded zero in multiplicand
          jz .Done
          mul dword [ebx]



          If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:




          ...detecting embedded zeroes came at no apparant cost...




          Meaning the code between lods and mul got executed for free!



          .MultiplicandLoop:
          lods dword [esi]
          add edi, 4
          mul dword [ebx]





          share|improve this answer





















          • Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
            – Sep Roland
            Jun 10 at 21:25












          up vote
          1
          down vote










          up vote
          1
          down vote









          A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.




           mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
          mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)




          B. The Intel optimization manual mentions:




          All branch targets should be 16-bytes aligned.




          You could apply this and see what it does for your code.




          C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.




          .MultiplicandLoop:
          add edi, 4
          lods dword [esi]
          test eax, eax ;Embedded zero in multiplicand
          jz .Done
          mul dword [ebx]



          If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:




          ...detecting embedded zeroes came at no apparant cost...




          Meaning the code between lods and mul got executed for free!



          .MultiplicandLoop:
          lods dword [esi]
          add edi, 4
          mul dword [ebx]





          share|improve this answer













          A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.




           mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
          mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)




          B. The Intel optimization manual mentions:




          All branch targets should be 16-bytes aligned.




          You could apply this and see what it does for your code.




          C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.




          .MultiplicandLoop:
          add edi, 4
          lods dword [esi]
          test eax, eax ;Embedded zero in multiplicand
          jz .Done
          mul dword [ebx]



          If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:




          ...detecting embedded zeroes came at no apparant cost...




          Meaning the code between lods and mul got executed for free!



          .MultiplicandLoop:
          lods dword [esi]
          add edi, 4
          mul dword [ebx]






          share|improve this answer













          share|improve this answer



          share|improve this answer











          answered Jun 7 at 10:55









          Fifoernik

          27827




          27827











          • Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
            – Sep Roland
            Jun 10 at 21:25
















          • Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
            – Sep Roland
            Jun 10 at 21:25















          Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
          – Sep Roland
          Jun 10 at 21:25




          Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
          – Sep Roland
          Jun 10 at 21:25












           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195752%2fmultiplying-big-numbers-using-long-multiplication%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Greedy Best First Search implementation in Rust

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          C++11 CLH Lock Implementation