Multiplying big numbers using Long Multiplication
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
5
down vote
favorite
Three weeks ago I posted Multiplying big numbers using Karatsuba's method where I made reference to my version of the classical long multiplication. That's what I'm posting today so people can compare for themselves. At the same time I'm hoping someone still sees ways to further improve this code!
The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.
- The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.
- The 2nd param points to the 1st input aka multiplicand.
- The 3rd param points to the 2nd input aka multiplier.
- The 4th param points to where the double length result needs to go.
The address of the double length product is returned in the EAX
register and the carry flag will be set if the result exceeds the precision.
I applied the following ideas:
I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.
Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.
I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.
I try to avoid having to enter the carry-loop as much as possible.
Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$
; -------------------------------------
; Multiplying using long multiplication
; -------------------------------------
; IN () OUT (eax,CF)
mpMul: pushad
mov ebp, esp
mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
shr edx, 5 ;Bits -> dwords
mov edi, [ebp+32+4+12] ;BufferC (double length result)
mov [ebp+28], edi ;pushad.EAX
push edx edi ;(1)
imul eax, edx, 8
pxor xmm0, xmm0
.Wipe: sub eax, 16 ;Clear result buffer
movdqu [edi+eax], xmm0
jnz .Wipe
mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
call .Prune ; -> ECX ESI EDI
xchg ebx, esi
mov ebp, ecx
call .Prune ; -> ECX ESI EDI
cmp ebp, ecx
jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
xchg esi, ebx
xchg ecx, ebp
; - - - - - - - - - - - - - - - - - -
.MultiplierLoop:
cmp dword [ebx], 0 ;Embedded zero in multiplier
je .Zero
push ecx esi edi ;(2)
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
add [edi-4], eax
adc [edi], edx
jnc .Done
adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
jnc .Done
mov eax, edi
.Carry: add eax, 4 ;The carry-loop
add dword [eax+4], 1 ;(*) +4
jc .Carry
.Done: dec ecx
jnz .MultiplicandLoop
pop edi esi ecx ;(2)
.Zero: add edi, 4
add ebx, 4
dec ebp
jnz .MultiplierLoop
; - - - - - - - - - - - - - - - - - -
.END: pop edi ecx ;(1)
lea edi, [edi+ecx*4] ;Middle of double length result
xor eax, eax
repe scas dword [edi] ;Defines CF
popad
ret 16
; - - - - - - - - - - - - - - - - - -
; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
.Prune: mov ecx, edx ;Precision expressed as dwords (2+)
.Lead: dec ecx ;Skip leading zeroes
js .Quit ;Multiplier and/or Multiplicand is 0
cmp [esi+ecx*4], eax ;EAX=0
je .Lead
inc ecx ;Significant dwords
cmp [esi], eax ;EAX=0
jne .Ready
.Trail: dec ecx ;Skip trailing zeroes
add esi, 4 ;Skip low dwords that are zero
add edi, 4 ; by moving starting points upwards
cmp [esi], eax ;EAX=0
je .Trail
.Ready: ret
.Quit: pop eax ;Forget 'call .Prune'
jmp .END
; -----------------------------------
performance algorithm assembly x86
add a comment |Â
up vote
5
down vote
favorite
Three weeks ago I posted Multiplying big numbers using Karatsuba's method where I made reference to my version of the classical long multiplication. That's what I'm posting today so people can compare for themselves. At the same time I'm hoping someone still sees ways to further improve this code!
The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.
- The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.
- The 2nd param points to the 1st input aka multiplicand.
- The 3rd param points to the 2nd input aka multiplier.
- The 4th param points to where the double length result needs to go.
The address of the double length product is returned in the EAX
register and the carry flag will be set if the result exceeds the precision.
I applied the following ideas:
I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.
Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.
I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.
I try to avoid having to enter the carry-loop as much as possible.
Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$
; -------------------------------------
; Multiplying using long multiplication
; -------------------------------------
; IN () OUT (eax,CF)
mpMul: pushad
mov ebp, esp
mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
shr edx, 5 ;Bits -> dwords
mov edi, [ebp+32+4+12] ;BufferC (double length result)
mov [ebp+28], edi ;pushad.EAX
push edx edi ;(1)
imul eax, edx, 8
pxor xmm0, xmm0
.Wipe: sub eax, 16 ;Clear result buffer
movdqu [edi+eax], xmm0
jnz .Wipe
mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
call .Prune ; -> ECX ESI EDI
xchg ebx, esi
mov ebp, ecx
call .Prune ; -> ECX ESI EDI
cmp ebp, ecx
jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
xchg esi, ebx
xchg ecx, ebp
; - - - - - - - - - - - - - - - - - -
.MultiplierLoop:
cmp dword [ebx], 0 ;Embedded zero in multiplier
je .Zero
push ecx esi edi ;(2)
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
add [edi-4], eax
adc [edi], edx
jnc .Done
adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
jnc .Done
mov eax, edi
.Carry: add eax, 4 ;The carry-loop
add dword [eax+4], 1 ;(*) +4
jc .Carry
.Done: dec ecx
jnz .MultiplicandLoop
pop edi esi ecx ;(2)
.Zero: add edi, 4
add ebx, 4
dec ebp
jnz .MultiplierLoop
; - - - - - - - - - - - - - - - - - -
.END: pop edi ecx ;(1)
lea edi, [edi+ecx*4] ;Middle of double length result
xor eax, eax
repe scas dword [edi] ;Defines CF
popad
ret 16
; - - - - - - - - - - - - - - - - - -
; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
.Prune: mov ecx, edx ;Precision expressed as dwords (2+)
.Lead: dec ecx ;Skip leading zeroes
js .Quit ;Multiplier and/or Multiplicand is 0
cmp [esi+ecx*4], eax ;EAX=0
je .Lead
inc ecx ;Significant dwords
cmp [esi], eax ;EAX=0
jne .Ready
.Trail: dec ecx ;Skip trailing zeroes
add esi, 4 ;Skip low dwords that are zero
add edi, 4 ; by moving starting points upwards
cmp [esi], eax ;EAX=0
je .Trail
.Ready: ret
.Quit: pop eax ;Forget 'call .Prune'
jmp .END
; -----------------------------------
performance algorithm assembly x86
add a comment |Â
up vote
5
down vote
favorite
up vote
5
down vote
favorite
Three weeks ago I posted Multiplying big numbers using Karatsuba's method where I made reference to my version of the classical long multiplication. That's what I'm posting today so people can compare for themselves. At the same time I'm hoping someone still sees ways to further improve this code!
The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.
- The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.
- The 2nd param points to the 1st input aka multiplicand.
- The 3rd param points to the 2nd input aka multiplier.
- The 4th param points to where the double length result needs to go.
The address of the double length product is returned in the EAX
register and the carry flag will be set if the result exceeds the precision.
I applied the following ideas:
I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.
Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.
I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.
I try to avoid having to enter the carry-loop as much as possible.
Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$
; -------------------------------------
; Multiplying using long multiplication
; -------------------------------------
; IN () OUT (eax,CF)
mpMul: pushad
mov ebp, esp
mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
shr edx, 5 ;Bits -> dwords
mov edi, [ebp+32+4+12] ;BufferC (double length result)
mov [ebp+28], edi ;pushad.EAX
push edx edi ;(1)
imul eax, edx, 8
pxor xmm0, xmm0
.Wipe: sub eax, 16 ;Clear result buffer
movdqu [edi+eax], xmm0
jnz .Wipe
mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
call .Prune ; -> ECX ESI EDI
xchg ebx, esi
mov ebp, ecx
call .Prune ; -> ECX ESI EDI
cmp ebp, ecx
jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
xchg esi, ebx
xchg ecx, ebp
; - - - - - - - - - - - - - - - - - -
.MultiplierLoop:
cmp dword [ebx], 0 ;Embedded zero in multiplier
je .Zero
push ecx esi edi ;(2)
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
add [edi-4], eax
adc [edi], edx
jnc .Done
adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
jnc .Done
mov eax, edi
.Carry: add eax, 4 ;The carry-loop
add dword [eax+4], 1 ;(*) +4
jc .Carry
.Done: dec ecx
jnz .MultiplicandLoop
pop edi esi ecx ;(2)
.Zero: add edi, 4
add ebx, 4
dec ebp
jnz .MultiplierLoop
; - - - - - - - - - - - - - - - - - -
.END: pop edi ecx ;(1)
lea edi, [edi+ecx*4] ;Middle of double length result
xor eax, eax
repe scas dword [edi] ;Defines CF
popad
ret 16
; - - - - - - - - - - - - - - - - - -
; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
.Prune: mov ecx, edx ;Precision expressed as dwords (2+)
.Lead: dec ecx ;Skip leading zeroes
js .Quit ;Multiplier and/or Multiplicand is 0
cmp [esi+ecx*4], eax ;EAX=0
je .Lead
inc ecx ;Significant dwords
cmp [esi], eax ;EAX=0
jne .Ready
.Trail: dec ecx ;Skip trailing zeroes
add esi, 4 ;Skip low dwords that are zero
add edi, 4 ; by moving starting points upwards
cmp [esi], eax ;EAX=0
je .Trail
.Ready: ret
.Quit: pop eax ;Forget 'call .Prune'
jmp .END
; -----------------------------------
performance algorithm assembly x86
Three weeks ago I posted Multiplying big numbers using Karatsuba's method where I made reference to my version of the classical long multiplication. That's what I'm posting today so people can compare for themselves. At the same time I'm hoping someone still sees ways to further improve this code!
The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.
- The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.
- The 2nd param points to the 1st input aka multiplicand.
- The 3rd param points to the 2nd input aka multiplier.
- The 4th param points to where the double length result needs to go.
The address of the double length product is returned in the EAX
register and the carry flag will be set if the result exceeds the precision.
I applied the following ideas:
I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.
Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.
I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.
I try to avoid having to enter the carry-loop as much as possible.
Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$
; -------------------------------------
; Multiplying using long multiplication
; -------------------------------------
; IN () OUT (eax,CF)
mpMul: pushad
mov ebp, esp
mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
shr edx, 5 ;Bits -> dwords
mov edi, [ebp+32+4+12] ;BufferC (double length result)
mov [ebp+28], edi ;pushad.EAX
push edx edi ;(1)
imul eax, edx, 8
pxor xmm0, xmm0
.Wipe: sub eax, 16 ;Clear result buffer
movdqu [edi+eax], xmm0
jnz .Wipe
mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
call .Prune ; -> ECX ESI EDI
xchg ebx, esi
mov ebp, ecx
call .Prune ; -> ECX ESI EDI
cmp ebp, ecx
jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
xchg esi, ebx
xchg ecx, ebp
; - - - - - - - - - - - - - - - - - -
.MultiplierLoop:
cmp dword [ebx], 0 ;Embedded zero in multiplier
je .Zero
push ecx esi edi ;(2)
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
add [edi-4], eax
adc [edi], edx
jnc .Done
adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
jnc .Done
mov eax, edi
.Carry: add eax, 4 ;The carry-loop
add dword [eax+4], 1 ;(*) +4
jc .Carry
.Done: dec ecx
jnz .MultiplicandLoop
pop edi esi ecx ;(2)
.Zero: add edi, 4
add ebx, 4
dec ebp
jnz .MultiplierLoop
; - - - - - - - - - - - - - - - - - -
.END: pop edi ecx ;(1)
lea edi, [edi+ecx*4] ;Middle of double length result
xor eax, eax
repe scas dword [edi] ;Defines CF
popad
ret 16
; - - - - - - - - - - - - - - - - - -
; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
.Prune: mov ecx, edx ;Precision expressed as dwords (2+)
.Lead: dec ecx ;Skip leading zeroes
js .Quit ;Multiplier and/or Multiplicand is 0
cmp [esi+ecx*4], eax ;EAX=0
je .Lead
inc ecx ;Significant dwords
cmp [esi], eax ;EAX=0
jne .Ready
.Trail: dec ecx ;Skip trailing zeroes
add esi, 4 ;Skip low dwords that are zero
add edi, 4 ; by moving starting points upwards
cmp [esi], eax ;EAX=0
je .Trail
.Ready: ret
.Quit: pop eax ;Forget 'call .Prune'
jmp .END
; -----------------------------------
performance algorithm assembly x86
asked Jun 3 at 15:11
Sep Roland
1,771617
1,771617
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
3
down vote
Some perf thoughts:
Using xchg reg,reg is more expensive than you might expect.
pushad isn't great either.- I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of
addc
. - And while people are dubious about
repe scas
, it probably isn't an issue here. - It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.
It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.
You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.
Edit1:
A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64
, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu
with movdqa
. Tiny, but still.
Probably even tinier: I might experiment with changing imul eax, edx, 8
to either mov eax, edx ; shl eax, 3
or lea eax, [edx * 8]
. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4
better for my code at this point than lea esi, [esi + 4]
?
All good thoughts. Too bad that all the interesting stuff likemulx
,adcx
,adox
, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â Sep Roland
Jun 10 at 21:20
add a comment |Â
up vote
1
down vote
A. These EBP
relative displacements are wrong. They would have been correct if used in an ESP
relative addressing.
mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
B. The Intel optimization manual mentions:
All branch targets should be 16-bytes aligned.
You could apply this and see what it does for your code.
C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
If you remove the test
and jz
instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:
...detecting embedded zeroes came at no apparant cost...
Meaning the code between lods
and mul
got executed for free!
.MultiplicandLoop:
lods dword [esi]
add edi, 4
mul dword [ebx]
Thanks for the correction! I made a last-minute change from[esp]
to[ebp]
and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â Sep Roland
Jun 10 at 21:25
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
Some perf thoughts:
Using xchg reg,reg is more expensive than you might expect.
pushad isn't great either.- I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of
addc
. - And while people are dubious about
repe scas
, it probably isn't an issue here. - It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.
It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.
You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.
Edit1:
A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64
, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu
with movdqa
. Tiny, but still.
Probably even tinier: I might experiment with changing imul eax, edx, 8
to either mov eax, edx ; shl eax, 3
or lea eax, [edx * 8]
. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4
better for my code at this point than lea esi, [esi + 4]
?
All good thoughts. Too bad that all the interesting stuff likemulx
,adcx
,adox
, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â Sep Roland
Jun 10 at 21:20
add a comment |Â
up vote
3
down vote
Some perf thoughts:
Using xchg reg,reg is more expensive than you might expect.
pushad isn't great either.- I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of
addc
. - And while people are dubious about
repe scas
, it probably isn't an issue here. - It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.
It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.
You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.
Edit1:
A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64
, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu
with movdqa
. Tiny, but still.
Probably even tinier: I might experiment with changing imul eax, edx, 8
to either mov eax, edx ; shl eax, 3
or lea eax, [edx * 8]
. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4
better for my code at this point than lea esi, [esi + 4]
?
All good thoughts. Too bad that all the interesting stuff likemulx
,adcx
,adox
, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â Sep Roland
Jun 10 at 21:20
add a comment |Â
up vote
3
down vote
up vote
3
down vote
Some perf thoughts:
Using xchg reg,reg is more expensive than you might expect.
pushad isn't great either.- I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of
addc
. - And while people are dubious about
repe scas
, it probably isn't an issue here. - It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.
It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.
You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.
Edit1:
A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64
, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu
with movdqa
. Tiny, but still.
Probably even tinier: I might experiment with changing imul eax, edx, 8
to either mov eax, edx ; shl eax, 3
or lea eax, [edx * 8]
. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4
better for my code at this point than lea esi, [esi + 4]
?
Some perf thoughts:
Using xchg reg,reg is more expensive than you might expect.
pushad isn't great either.- I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of
addc
. - And while people are dubious about
repe scas
, it probably isn't an issue here. - It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.
It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.
You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.
Edit1:
A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64
, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu
with movdqa
. Tiny, but still.
Probably even tinier: I might experiment with changing imul eax, edx, 8
to either mov eax, edx ; shl eax, 3
or lea eax, [edx * 8]
. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4
better for my code at this point than lea esi, [esi + 4]
?
edited Jun 4 at 0:30
answered Jun 3 at 23:26
David Wohlferd
1,2581314
1,2581314
All good thoughts. Too bad that all the interesting stuff likemulx
,adcx
,adox
, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â Sep Roland
Jun 10 at 21:20
add a comment |Â
All good thoughts. Too bad that all the interesting stuff likemulx
,adcx
,adox
, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â Sep Roland
Jun 10 at 21:20
All good thoughts. Too bad that all the interesting stuff like
mulx
, adcx
, adox
, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.â Sep Roland
Jun 10 at 21:20
All good thoughts. Too bad that all the interesting stuff like
mulx
, adcx
, adox
, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.â Sep Roland
Jun 10 at 21:20
add a comment |Â
up vote
1
down vote
A. These EBP
relative displacements are wrong. They would have been correct if used in an ESP
relative addressing.
mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
B. The Intel optimization manual mentions:
All branch targets should be 16-bytes aligned.
You could apply this and see what it does for your code.
C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
If you remove the test
and jz
instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:
...detecting embedded zeroes came at no apparant cost...
Meaning the code between lods
and mul
got executed for free!
.MultiplicandLoop:
lods dword [esi]
add edi, 4
mul dword [ebx]
Thanks for the correction! I made a last-minute change from[esp]
to[ebp]
and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â Sep Roland
Jun 10 at 21:25
add a comment |Â
up vote
1
down vote
A. These EBP
relative displacements are wrong. They would have been correct if used in an ESP
relative addressing.
mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
B. The Intel optimization manual mentions:
All branch targets should be 16-bytes aligned.
You could apply this and see what it does for your code.
C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
If you remove the test
and jz
instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:
...detecting embedded zeroes came at no apparant cost...
Meaning the code between lods
and mul
got executed for free!
.MultiplicandLoop:
lods dword [esi]
add edi, 4
mul dword [ebx]
Thanks for the correction! I made a last-minute change from[esp]
to[ebp]
and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â Sep Roland
Jun 10 at 21:25
add a comment |Â
up vote
1
down vote
up vote
1
down vote
A. These EBP
relative displacements are wrong. They would have been correct if used in an ESP
relative addressing.
mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
B. The Intel optimization manual mentions:
All branch targets should be 16-bytes aligned.
You could apply this and see what it does for your code.
C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
If you remove the test
and jz
instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:
...detecting embedded zeroes came at no apparant cost...
Meaning the code between lods
and mul
got executed for free!
.MultiplicandLoop:
lods dword [esi]
add edi, 4
mul dword [ebx]
A. These EBP
relative displacements are wrong. They would have been correct if used in an ESP
relative addressing.
mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
B. The Intel optimization manual mentions:
All branch targets should be 16-bytes aligned.
You could apply this and see what it does for your code.
C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.
.MultiplicandLoop:
add edi, 4
lods dword [esi]
test eax, eax ;Embedded zero in multiplicand
jz .Done
mul dword [ebx]
If you remove the test
and jz
instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:
...detecting embedded zeroes came at no apparant cost...
Meaning the code between lods
and mul
got executed for free!
.MultiplicandLoop:
lods dword [esi]
add edi, 4
mul dword [ebx]
answered Jun 7 at 10:55
Fifoernik
27827
27827
Thanks for the correction! I made a last-minute change from[esp]
to[ebp]
and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â Sep Roland
Jun 10 at 21:25
add a comment |Â
Thanks for the correction! I made a last-minute change from[esp]
to[ebp]
and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â Sep Roland
Jun 10 at 21:25
Thanks for the correction! I made a last-minute change from
[esp]
to [ebp]
and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.â Sep Roland
Jun 10 at 21:25
Thanks for the correction! I made a last-minute change from
[esp]
to [ebp]
and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.â Sep Roland
Jun 10 at 21:25
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195752%2fmultiplying-big-numbers-using-long-multiplication%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password