Multiplying big numbers using Long Multiplication

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
5
down vote

favorite

Three weeks ago I posted Multiplying big numbers using Karatsuba's method where I made reference to my version of the classical long multiplication. That's what I'm posting today so people can compare for themselves. At the same time I'm hoping someone still sees ways to further improve this code!

The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.

The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.

The 2nd param points to the 1st input aka multiplicand.

The 3rd param points to the 2nd input aka multiplier.

The 4th param points to where the double length result needs to go.

The address of the double length product is returned in the EAX register and the carry flag will be set if the result exceeds the precision.

I applied the following ideas:

I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.

Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.

I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.

I try to avoid having to enter the carry-loop as much as possible.

Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$

Computing MAX_UINT^2

; -------------------------------------
; Multiplying using long multiplication
; -------------------------------------
; IN () OUT (eax,CF)
mpMul: pushad
 mov ebp, esp
 mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
 shr edx, 5 ;Bits -> dwords

 mov edi, [ebp+32+4+12] ;BufferC (double length result)
 mov [ebp+28], edi ;pushad.EAX
 push edx edi ;(1)

 imul eax, edx, 8
 pxor xmm0, xmm0
.Wipe: sub eax, 16 ;Clear result buffer
 movdqu [edi+eax], xmm0
 jnz .Wipe

 mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
 mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
 call .Prune ; -> ECX ESI EDI
 xchg ebx, esi
 mov ebp, ecx
 call .Prune ; -> ECX ESI EDI

 cmp ebp, ecx
 jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
 xchg esi, ebx
 xchg ecx, ebp
; - - - - - - - - - - - - - - - - - -
.MultiplierLoop:
 cmp dword [ebx], 0 ;Embedded zero in multiplier
 je .Zero

 push ecx esi edi ;(2)
.MultiplicandLoop:
 add edi, 4
 lods dword [esi]
 test eax, eax ;Embedded zero in multiplicand
 jz .Done
 mul dword [ebx]
 add [edi-4], eax
 adc [edi], edx
 jnc .Done
 adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
 jnc .Done
 mov eax, edi
.Carry: add eax, 4 ;The carry-loop
 add dword [eax+4], 1 ;(*) +4
 jc .Carry
.Done: dec ecx
 jnz .MultiplicandLoop
 pop edi esi ecx ;(2)

.Zero: add edi, 4
 add ebx, 4
 dec ebp
 jnz .MultiplierLoop
; - - - - - - - - - - - - - - - - - -
.END: pop edi ecx ;(1)
 lea edi, [edi+ecx*4] ;Middle of double length result
 xor eax, eax
 repe scas dword [edi] ;Defines CF

 popad
 ret 16
; - - - - - - - - - - - - - - - - - -
; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
.Prune: mov ecx, edx ;Precision expressed as dwords (2+)
.Lead: dec ecx ;Skip leading zeroes
 js .Quit ;Multiplier and/or Multiplicand is 0
 cmp [esi+ecx*4], eax ;EAX=0
 je .Lead
 inc ecx ;Significant dwords
 cmp [esi], eax ;EAX=0
 jne .Ready
.Trail: dec ecx ;Skip trailing zeroes
 add esi, 4 ;Skip low dwords that are zero
 add edi, 4 ; by moving starting points upwards
 cmp [esi], eax ;EAX=0
 je .Trail
.Ready: ret
.Quit: pop eax ;Forget 'call .Prune'
 jmp .END
; -----------------------------------

asked Jun 3 at 15:11

Sep Roland

1,771617

add a commentÂ |Â

up vote
5
down vote

favorite

The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.

The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.

The 2nd param points to the 1st input aka multiplicand.

The 3rd param points to the 2nd input aka multiplier.

The 4th param points to where the double length result needs to go.

The address of the double length product is returned in the EAX register and the carry flag will be set if the result exceeds the precision.

I applied the following ideas:

I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.

Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.

I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.

I try to avoid having to enter the carry-loop as much as possible.

Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$

Computing MAX_UINT^2

; -------------------------------------
; Multiplying using long multiplication
; -------------------------------------
; IN () OUT (eax,CF)
mpMul: pushad
 mov ebp, esp
 mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
 shr edx, 5 ;Bits -> dwords

 mov edi, [ebp+32+4+12] ;BufferC (double length result)
 mov [ebp+28], edi ;pushad.EAX
 push edx edi ;(1)

 imul eax, edx, 8
 pxor xmm0, xmm0
.Wipe: sub eax, 16 ;Clear result buffer
 movdqu [edi+eax], xmm0
 jnz .Wipe

 mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
 mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
 call .Prune ; -> ECX ESI EDI
 xchg ebx, esi
 mov ebp, ecx
 call .Prune ; -> ECX ESI EDI

 cmp ebp, ecx
 jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
 xchg esi, ebx
 xchg ecx, ebp
; - - - - - - - - - - - - - - - - - -
.MultiplierLoop:
 cmp dword [ebx], 0 ;Embedded zero in multiplier
 je .Zero

 push ecx esi edi ;(2)
.MultiplicandLoop:
 add edi, 4
 lods dword [esi]
 test eax, eax ;Embedded zero in multiplicand
 jz .Done
 mul dword [ebx]
 add [edi-4], eax
 adc [edi], edx
 jnc .Done
 adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
 jnc .Done
 mov eax, edi
.Carry: add eax, 4 ;The carry-loop
 add dword [eax+4], 1 ;(*) +4
 jc .Carry
.Done: dec ecx
 jnz .MultiplicandLoop
 pop edi esi ecx ;(2)

.Zero: add edi, 4
 add ebx, 4
 dec ebp
 jnz .MultiplierLoop
; - - - - - - - - - - - - - - - - - -
.END: pop edi ecx ;(1)
 lea edi, [edi+ecx*4] ;Middle of double length result
 xor eax, eax
 repe scas dword [edi] ;Defines CF

 popad
 ret 16
; - - - - - - - - - - - - - - - - - -
; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
.Prune: mov ecx, edx ;Precision expressed as dwords (2+)
.Lead: dec ecx ;Skip leading zeroes
 js .Quit ;Multiplier and/or Multiplicand is 0
 cmp [esi+ecx*4], eax ;EAX=0
 je .Lead
 inc ecx ;Significant dwords
 cmp [esi], eax ;EAX=0
 jne .Ready
.Trail: dec ecx ;Skip trailing zeroes
 add esi, 4 ;Skip low dwords that are zero
 add edi, 4 ; by moving starting points upwards
 cmp [esi], eax ;EAX=0
 je .Trail
.Ready: ret
.Quit: pop eax ;Forget 'call .Prune'
 jmp .END
; -----------------------------------

asked Jun 3 at 15:11

Sep Roland

1,771617

add a commentÂ |Â

up vote
5
down vote

favorite

The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.

The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.

The 2nd param points to the 1st input aka multiplicand.

The 3rd param points to the 2nd input aka multiplier.

The 4th param points to where the double length result needs to go.

The address of the double length product is returned in the EAX register and the carry flag will be set if the result exceeds the precision.

I applied the following ideas:

I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.

Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.

I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.

I try to avoid having to enter the carry-loop as much as possible.

Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$

Computing MAX_UINT^2

; -------------------------------------
; Multiplying using long multiplication
; -------------------------------------
; IN () OUT (eax,CF)
mpMul: pushad
 mov ebp, esp
 mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
 shr edx, 5 ;Bits -> dwords

 mov edi, [ebp+32+4+12] ;BufferC (double length result)
 mov [ebp+28], edi ;pushad.EAX
 push edx edi ;(1)

 imul eax, edx, 8
 pxor xmm0, xmm0
.Wipe: sub eax, 16 ;Clear result buffer
 movdqu [edi+eax], xmm0
 jnz .Wipe

 mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
 mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
 call .Prune ; -> ECX ESI EDI
 xchg ebx, esi
 mov ebp, ecx
 call .Prune ; -> ECX ESI EDI

 cmp ebp, ecx
 jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
 xchg esi, ebx
 xchg ecx, ebp
; - - - - - - - - - - - - - - - - - -
.MultiplierLoop:
 cmp dword [ebx], 0 ;Embedded zero in multiplier
 je .Zero

 push ecx esi edi ;(2)
.MultiplicandLoop:
 add edi, 4
 lods dword [esi]
 test eax, eax ;Embedded zero in multiplicand
 jz .Done
 mul dword [ebx]
 add [edi-4], eax
 adc [edi], edx
 jnc .Done
 adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
 jnc .Done
 mov eax, edi
.Carry: add eax, 4 ;The carry-loop
 add dword [eax+4], 1 ;(*) +4
 jc .Carry
.Done: dec ecx
 jnz .MultiplicandLoop
 pop edi esi ecx ;(2)

.Zero: add edi, 4
 add ebx, 4
 dec ebp
 jnz .MultiplierLoop
; - - - - - - - - - - - - - - - - - -
.END: pop edi ecx ;(1)
 lea edi, [edi+ecx*4] ;Middle of double length result
 xor eax, eax
 repe scas dword [edi] ;Defines CF

 popad
 ret 16
; - - - - - - - - - - - - - - - - - -
; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
.Prune: mov ecx, edx ;Precision expressed as dwords (2+)
.Lead: dec ecx ;Skip leading zeroes
 js .Quit ;Multiplier and/or Multiplicand is 0
 cmp [esi+ecx*4], eax ;EAX=0
 je .Lead
 inc ecx ;Significant dwords
 cmp [esi], eax ;EAX=0
 jne .Ready
.Trail: dec ecx ;Skip trailing zeroes
 add esi, 4 ;Skip low dwords that are zero
 add edi, 4 ; by moving starting points upwards
 cmp [esi], eax ;EAX=0
 je .Trail
.Ready: ret
.Quit: pop eax ;Forget 'call .Prune'
 jmp .END
; -----------------------------------

asked Jun 3 at 15:11

Sep Roland

1,771617

The multi-precision mpMul procedure that I present below requires 4 parameters passed on the stack.

The 1st param specifies the precision (length in bits) of both the inputs which needs to be a multiple of 64.

The 2nd param points to the 1st input aka multiplicand.

The 3rd param points to the 2nd input aka multiplier.

The 4th param points to where the double length result needs to go.

The address of the double length product is returned in the EAX register and the carry flag will be set if the result exceeds the precision.

I applied the following ideas:

I detect leading and trailing zeroes beforehand. All leading zeroes can simply be ignored. For each trailing zero in the multiplicand or multiplier, a new trailing zero is inserted in the result.

Since detecting embedded zeroes came at no apparant cost, appropriate shortcuts were installed.

I let the shorter number control the outer loop, thus reducing the number of times the costlier inner loop has to run.

I try to avoid having to enter the carry-loop as much as possible.

Computing MAX_UINT^2, the number of multiplications is $4^(n-5)$ where $n=log_2textPrecision$

Computing MAX_UINT^2

; -------------------------------------
; Multiplying using long multiplication
; -------------------------------------
; IN () OUT (eax,CF)
mpMul: pushad
 mov ebp, esp
 mov edx, [ebp+32+4] ;IntSize 64, 128, 192, ...
 shr edx, 5 ;Bits -> dwords

 mov edi, [ebp+32+4+12] ;BufferC (double length result)
 mov [ebp+28], edi ;pushad.EAX
 push edx edi ;(1)

 imul eax, edx, 8
 pxor xmm0, xmm0
.Wipe: sub eax, 16 ;Clear result buffer
 movdqu [edi+eax], xmm0
 jnz .Wipe

 mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
 mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)
 call .Prune ; -> ECX ESI EDI
 xchg ebx, esi
 mov ebp, ecx
 call .Prune ; -> ECX ESI EDI

 cmp ebp, ecx
 jbe .MultiplierLoop ;Multiplier 'is shorter' Multiplicand
 xchg esi, ebx
 xchg ecx, ebp
; - - - - - - - - - - - - - - - - - -
.MultiplierLoop:
 cmp dword [ebx], 0 ;Embedded zero in multiplier
 je .Zero

 push ecx esi edi ;(2)
.MultiplicandLoop:
 add edi, 4
 lods dword [esi]
 test eax, eax ;Embedded zero in multiplicand
 jz .Done
 mul dword [ebx]
 add [edi-4], eax
 adc [edi], edx
 jnc .Done
 adc dword [edi+4], 0 ;(*) Avoid entering the carry-loop
 jnc .Done
 mov eax, edi
.Carry: add eax, 4 ;The carry-loop
 add dword [eax+4], 1 ;(*) +4
 jc .Carry
.Done: dec ecx
 jnz .MultiplicandLoop
 pop edi esi ecx ;(2)

.Zero: add edi, 4
 add ebx, 4
 dec ebp
 jnz .MultiplierLoop
; - - - - - - - - - - - - - - - - - -
.END: pop edi ecx ;(1)
 lea edi, [edi+ecx*4] ;Middle of double length result
 xor eax, eax
 repe scas dword [edi] ;Defines CF

 popad
 ret 16
; - - - - - - - - - - - - - - - - - -
; IN (eax=0,edx,esi,edi) OUT (ecx,esi,edi)
.Prune: mov ecx, edx ;Precision expressed as dwords (2+)
.Lead: dec ecx ;Skip leading zeroes
 js .Quit ;Multiplier and/or Multiplicand is 0
 cmp [esi+ecx*4], eax ;EAX=0
 je .Lead
 inc ecx ;Significant dwords
 cmp [esi], eax ;EAX=0
 jne .Ready
.Trail: dec ecx ;Skip trailing zeroes
 add esi, 4 ;Skip low dwords that are zero
 add edi, 4 ; by moving starting points upwards
 cmp [esi], eax ;EAX=0
 je .Trail
.Ready: ret
.Quit: pop eax ;Forget 'call .Prune'
 jmp .END
; -----------------------------------

asked Jun 3 at 15:11

Sep Roland

1,771617

asked Jun 3 at 15:11

Sep Roland

1,771617

asked Jun 3 at 15:11

Sep Roland

1,771617

asked Jun 3 at 15:11

Sep Roland

1,771617

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
3
down vote

Some perf thoughts:

Using xchg reg,reg is more expensive than you might expect.

pushad isn't great either.

I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

And while people are dubious about repe scas, it probably isn't an issue here.

It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.

You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.

Edit1:

A late breaking thought. Since you already have a requirement that your inputs needs to be a multiple of 64, how about adding another one that says your memory pointers must be 64 byte aligned? This allows you to replace movdqu with movdqa. Tiny, but still.

Probably even tinier: I might experiment with changing imul eax, edx, 8 to either mov eax, edx ; shl eax, 3 or lea eax, [edx * 8]. From a "length of instructions" viewpoint, imul is the clear winner. From a "latency" perspective, perhaps less so. IACA might give you this answer. It's pretty good for answering questions like: Is add esi, 4 better for my code at this point than lea esi, [esi + 4]?

edited Jun 4 at 0:30

answered Jun 3 at 23:26

David Wohlferd

1,2581314

All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â€“Â Sep Roland
Jun 10 at 21:20

add a commentÂ |Â

up vote
1
down vote

A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.

 mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
 mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)

B. The Intel optimization manual mentions:

All branch targets should be 16-bytes aligned.

You could apply this and see what it does for your code.

C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.

.MultiplicandLoop:
 add edi, 4
 lods dword [esi]
 test eax, eax ;Embedded zero in multiplicand
 jz .Done
 mul dword [ebx]

If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:

...detecting embedded zeroes came at no apparant cost...

Meaning the code between lods and mul got executed for free!

.MultiplicandLoop:
 lods dword [esi]
 add edi, 4
 mul dword [ebx]

answered Jun 7 at 10:55

Fifoernik

27827

Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â€“Â Sep Roland
Jun 10 at 21:25

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f195752%2fmultiplying-big-numbers-using-long-multiplication%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
3
down vote

Some perf thoughts:

Using xchg reg,reg is more expensive than you might expect.

pushad isn't great either.

I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

And while people are dubious about repe scas, it probably isn't an issue here.

It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.

You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.

Edit1:

edited Jun 4 at 0:30

answered Jun 3 at 23:26

David Wohlferd

1,2581314

All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â€“Â Sep Roland
Jun 10 at 21:20

add a commentÂ |Â

up vote
3
down vote

Some perf thoughts:

Using xchg reg,reg is more expensive than you might expect.

pushad isn't great either.

I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

And while people are dubious about repe scas, it probably isn't an issue here.

It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.

You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.

Edit1:

edited Jun 4 at 0:30

answered Jun 3 at 23:26

David Wohlferd

1,2581314

All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â€“Â Sep Roland
Jun 10 at 21:20

add a commentÂ |Â

up vote
3
down vote

Some perf thoughts:

Using xchg reg,reg is more expensive than you might expect.

pushad isn't great either.

I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

And while people are dubious about repe scas, it probably isn't an issue here.

It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.

You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.

Edit1:

edited Jun 4 at 0:30

answered Jun 3 at 23:26

David Wohlferd

1,2581314

Some perf thoughts:

Using xchg reg,reg is more expensive than you might expect.

pushad isn't great either.

I'm not sure how useful it might be here, but you might also take a look at this. This has to do with the issues related to chains of addc.

And while people are dubious about repe scas, it probably isn't an issue here.

It might also be interesting to run this thru IACA. Something as simple as reversing two instructions can make a measurable difference. It's hard for humans to write efficient asm anymore. Apologies if you are actually an AI. That's not as unlikely today as it once was.

It might also be interesting to write this in x64. More registers, more bits per register. There's got to be some benefit there somewhere, and x64 is pretty much the standard these days.

You were probably hoping for something a little more analytical about the algorithm itself, but this is what I've got.

Edit1:

edited Jun 4 at 0:30

answered Jun 3 at 23:26

David Wohlferd

1,2581314

edited Jun 4 at 0:30

answered Jun 3 at 23:26

David Wohlferd

1,2581314

answered Jun 3 at 23:26

David Wohlferd

1,2581314

answered Jun 3 at 23:26

David Wohlferd

1,2581314

All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â€“Â Sep Roland
Jun 10 at 21:20

add a commentÂ |Â

All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â€“Â Sep Roland
Jun 10 at 21:20

All good thoughts. Too bad that all the interesting stuff like mulx, adcx, adox, and x64 registers isn't available on my computer. I certainly see how they would improve the code. Given that all the other suggestions pertain to the one-time prologue and epilogue code, I can't expect too much from applying them.
â€“Â Sep Roland
Jun 10 at 21:20

add a commentÂ |Â

up vote
1
down vote

A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.

 mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
 mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)

B. The Intel optimization manual mentions:

All branch targets should be 16-bytes aligned.

You could apply this and see what it does for your code.

C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.

.MultiplicandLoop:
 add edi, 4
 lods dword [esi]
 test eax, eax ;Embedded zero in multiplicand
 jz .Done
 mul dword [ebx]

If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:

...detecting embedded zeroes came at no apparant cost...

Meaning the code between lods and mul got executed for free!

.MultiplicandLoop:
 lods dword [esi]
 add edi, 4
 mul dword [ebx]

answered Jun 7 at 10:55

Fifoernik

27827

Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â€“Â Sep Roland
Jun 10 at 21:25

add a commentÂ |Â

up vote
1
down vote

A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.

 mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
 mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)

B. The Intel optimization manual mentions:

All branch targets should be 16-bytes aligned.

You could apply this and see what it does for your code.

C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.

.MultiplicandLoop:
 add edi, 4
 lods dword [esi]
 test eax, eax ;Embedded zero in multiplicand
 jz .Done
 mul dword [ebx]

If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:

...detecting embedded zeroes came at no apparant cost...

Meaning the code between lods and mul got executed for free!

.MultiplicandLoop:
 lods dword [esi]
 add edi, 4
 mul dword [ebx]

answered Jun 7 at 10:55

Fifoernik

27827

Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â€“Â Sep Roland
Jun 10 at 21:25

add a commentÂ |Â

up vote
1
down vote

A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.

 mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
 mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)

B. The Intel optimization manual mentions:

All branch targets should be 16-bytes aligned.

You could apply this and see what it does for your code.

C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.

.MultiplicandLoop:
 add edi, 4
 lods dword [esi]
 test eax, eax ;Embedded zero in multiplicand
 jz .Done
 mul dword [ebx]

If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:

...detecting embedded zeroes came at no apparant cost...

Meaning the code between lods and mul got executed for free!

.MultiplicandLoop:
 lods dword [esi]
 add edi, 4
 mul dword [ebx]

answered Jun 7 at 10:55

Fifoernik

27827

A. These EBP relative displacements are wrong. They would have been correct if used in an ESP relative addressing.

 mov esi, [ebp+8+32+4+8] ;BufferB (default multiplier)
 mov ebx, [ebp+8+32+4+4] ;BufferA (default multiplicand)

B. The Intel optimization manual mentions:

All branch targets should be 16-bytes aligned.

You could apply this and see what it does for your code.

C. I don't see any benefit from checking for embedded zeroes. I think those will be extremely rare.

.MultiplicandLoop:
 add edi, 4
 lods dword [esi]
 test eax, eax ;Embedded zero in multiplicand
 jz .Done
 mul dword [ebx]

If you remove the test and jz instructions from above code, you should insert some other (useful) instruction. Why? The hint is in what you said yourself:

...detecting embedded zeroes came at no apparant cost...

Meaning the code between lods and mul got executed for free!

.MultiplicandLoop:
 lods dword [esi]
 add edi, 4
 mul dword [ebx]

answered Jun 7 at 10:55

Fifoernik

27827

answered Jun 7 at 10:55

Fifoernik

27827

answered Jun 7 at 10:55

Fifoernik

27827

answered Jun 7 at 10:55

Fifoernik

27827

Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â€“Â Sep Roland
Jun 10 at 21:25

add a commentÂ |Â

Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â€“Â Sep Roland
Jun 10 at 21:25

Thanks for the correction! I made a last-minute change from [esp] to [ebp] and forgot those 2 displacements. Nice catch. Concerning C. I can report I got a 3% speed increase from applying that suggestion.
â€“Â Sep Roland
Jun 10 at 21:25

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr