Determine added, removed and kept items between two lists

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
9
down vote

favorite

My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.

def diff(before, after):
 """Return diff between iterables of sortable items `before` and `after` as
 tuple of three lists of items that have been added, removed and kept.
 """
 before_iter = iter(sorted(before))
 after_iter = iter(sorted(after))
 add_items = 
 rem_items = 
 keep_items = 
 try:
 new = old = None
 while True:
 if new == old:
 if new is not None:
 keep_items.append(new)
 new = old = None
 new = next(after_iter)
 old = next(before_iter)
 elif new < old:
 add_items.append(new)
 new = None
 new = next(after_iter)
 elif old < new:
 rem_items.append(old)
 old = None
 old = next(before_iter)
 except StopIteration:
 if new is not None:
 add_items.append(new)
 if old is not None:
 rem_items.append(old)
 add_items += list(after_iter)
 rem_items += list(before_iter)
 return add_items, rem_items, keep_items

if __name__ == '__main__':
 assert diff(, ) == (, , )
 assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
 assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])

Main concerns:

The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?

The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)

Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:

The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.

edited May 15 at 19:07

asked May 15 at 13:52

Arminius

1516

If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â€“Â Alex Hall
May 16 at 12:39

add a commentÂ |Â

up vote
9
down vote

favorite

My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.

def diff(before, after):
 """Return diff between iterables of sortable items `before` and `after` as
 tuple of three lists of items that have been added, removed and kept.
 """
 before_iter = iter(sorted(before))
 after_iter = iter(sorted(after))
 add_items = 
 rem_items = 
 keep_items = 
 try:
 new = old = None
 while True:
 if new == old:
 if new is not None:
 keep_items.append(new)
 new = old = None
 new = next(after_iter)
 old = next(before_iter)
 elif new < old:
 add_items.append(new)
 new = None
 new = next(after_iter)
 elif old < new:
 rem_items.append(old)
 old = None
 old = next(before_iter)
 except StopIteration:
 if new is not None:
 add_items.append(new)
 if old is not None:
 rem_items.append(old)
 add_items += list(after_iter)
 rem_items += list(before_iter)
 return add_items, rem_items, keep_items

if __name__ == '__main__':
 assert diff(, ) == (, , )
 assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
 assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])

Main concerns:

The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?

The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)

Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:

The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.

edited May 15 at 19:07

asked May 15 at 13:52

Arminius

1516

If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â€“Â Alex Hall
May 16 at 12:39

add a commentÂ |Â

up vote
9
down vote

favorite

My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.

def diff(before, after):
 """Return diff between iterables of sortable items `before` and `after` as
 tuple of three lists of items that have been added, removed and kept.
 """
 before_iter = iter(sorted(before))
 after_iter = iter(sorted(after))
 add_items = 
 rem_items = 
 keep_items = 
 try:
 new = old = None
 while True:
 if new == old:
 if new is not None:
 keep_items.append(new)
 new = old = None
 new = next(after_iter)
 old = next(before_iter)
 elif new < old:
 add_items.append(new)
 new = None
 new = next(after_iter)
 elif old < new:
 rem_items.append(old)
 old = None
 old = next(before_iter)
 except StopIteration:
 if new is not None:
 add_items.append(new)
 if old is not None:
 rem_items.append(old)
 add_items += list(after_iter)
 rem_items += list(before_iter)
 return add_items, rem_items, keep_items

if __name__ == '__main__':
 assert diff(, ) == (, , )
 assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
 assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])

Main concerns:

The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?

The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)

Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:

The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.

edited May 15 at 19:07

asked May 15 at 13:52

Arminius

1516

My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.

def diff(before, after):
 """Return diff between iterables of sortable items `before` and `after` as
 tuple of three lists of items that have been added, removed and kept.
 """
 before_iter = iter(sorted(before))
 after_iter = iter(sorted(after))
 add_items = 
 rem_items = 
 keep_items = 
 try:
 new = old = None
 while True:
 if new == old:
 if new is not None:
 keep_items.append(new)
 new = old = None
 new = next(after_iter)
 old = next(before_iter)
 elif new < old:
 add_items.append(new)
 new = None
 new = next(after_iter)
 elif old < new:
 rem_items.append(old)
 old = None
 old = next(before_iter)
 except StopIteration:
 if new is not None:
 add_items.append(new)
 if old is not None:
 rem_items.append(old)
 add_items += list(after_iter)
 rem_items += list(before_iter)
 return add_items, rem_items, keep_items

if __name__ == '__main__':
 assert diff(, ) == (, , )
 assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
 assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])

Main concerns:

The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?

The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)

Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:

The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.

edited May 15 at 19:07

asked May 15 at 13:52

Arminius

1516

edited May 15 at 19:07

asked May 15 at 13:52

Arminius

1516

asked May 15 at 13:52

Arminius

1516

asked May 15 at 13:52

Arminius

1516

If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â€“Â Alex Hall
May 16 at 12:39

add a commentÂ |Â

If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â€“Â Alex Hall
May 16 at 12:39

If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â€“Â Alex Hall
May 16 at 12:39

add a commentÂ |Â

2 Answers
2

active

oldest

votes

up vote
15
down vote

accepted

First, let's look at the naive implementation using sets:

def diff_set(before, after):
 b, a = set(before), set(after)
 return list(a - b), list(b - a), list(a & b)

This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?

Here are timings of your three example cases:

 Input diff diff_set diff_counter 
 ------------------------------ ------------------- ------------------- ------------------- 
 , 1.72 Ã‚Âµs Ã‚Â± 29.9 ns 1.21 Ã‚Âµs Ã‚Â± 5.34 ns 9.91 Ã‚Âµs Ã‚Â± 91.4 ns 
 [1,2,3], [4,3,2] 3.16 Ã‚Âµs Ã‚Â± 183 ns 1.57 Ã‚Âµs Ã‚Â± 28.4 ns 12 Ã‚Âµs Ã‚Â± 180 ns 
 ['foo','bar'], ['baz','bar'] 2.77 Ã‚Âµs Ã‚Â± 83.4 ns 1.56 Ã‚Âµs Ã‚Â± 37.7 ns 11.5 Ã‚Âµs Ã‚Â± 240 ns

The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.

import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]

Here also the set implementation wins:

enter image description here

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.

For reference, here the implementation of diff_counter I used:

from collections import Counter

def diff_counter(before, after):
 c = Counter(after)
 c.subtract(before)
 return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]

edited May 15 at 16:19

answered May 15 at 14:25

Graipher

20.4k42981

2

Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â€“Â Arminius
May 15 at 14:36

IÃ¢Â€Â™m having a hard time accepting that you call code using sets the Ã¢Â€Âœnaive implementationÃ¢Â€Â, and I consequently donÃ¢Â€Â™t agree with your recommendation. DonÃ¢Â€Â™t try the Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, Ã¢Â€Â¦) approach. But donÃ¢Â€Â™t default to Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â. ThatÃ¢Â€Â™s premature pessimisation.
â€“Â Konrad Rudolph
May 15 at 15:16

@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â€“Â Graipher
May 15 at 15:34

1

As a drawback, I guess this works only with Hashable values.
â€“Â Josay
May 15 at 16:10

1

@Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
â€“Â Graipher
May 15 at 16:13

Â |Â
show 2 more comments

up vote
6
down vote

With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.

But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:

new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
 return , , 
elif new is None:
 return , before, 
elif old is None:
 return after, , 

try:
 while True:
 Ã¢Â€Â¦

However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.

An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:

>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)

As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.

Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):

>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]

But kept elements doesn't have a nice way to retrieve them:

>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]

answered May 15 at 14:37

Mathias Ettinger

21.8k32875

2

Updated timings in my answer to include this function.
â€“Â Graipher
May 15 at 14:43

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194459%2fdetermine-added-removed-and-kept-items-between-two-lists%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
15
down vote

accepted

First, let's look at the naive implementation using sets:

def diff_set(before, after):
 b, a = set(before), set(after)
 return list(a - b), list(b - a), list(a & b)

This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?

Here are timings of your three example cases:

 Input diff diff_set diff_counter 
 ------------------------------ ------------------- ------------------- ------------------- 
 , 1.72 Ã‚Âµs Ã‚Â± 29.9 ns 1.21 Ã‚Âµs Ã‚Â± 5.34 ns 9.91 Ã‚Âµs Ã‚Â± 91.4 ns 
 [1,2,3], [4,3,2] 3.16 Ã‚Âµs Ã‚Â± 183 ns 1.57 Ã‚Âµs Ã‚Â± 28.4 ns 12 Ã‚Âµs Ã‚Â± 180 ns 
 ['foo','bar'], ['baz','bar'] 2.77 Ã‚Âµs Ã‚Â± 83.4 ns 1.56 Ã‚Âµs Ã‚Â± 37.7 ns 11.5 Ã‚Âµs Ã‚Â± 240 ns

import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]

Here also the set implementation wins:

enter image description here

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.

For reference, here the implementation of diff_counter I used:

from collections import Counter

def diff_counter(before, after):
 c = Counter(after)
 c.subtract(before)
 return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]

edited May 15 at 16:19

answered May 15 at 14:25

Graipher

20.4k42981

2

Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â€“Â Arminius
May 15 at 14:36

IÃ¢Â€Â™m having a hard time accepting that you call code using sets the Ã¢Â€Âœnaive implementationÃ¢Â€Â, and I consequently donÃ¢Â€Â™t agree with your recommendation. DonÃ¢Â€Â™t try the Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, Ã¢Â€Â¦) approach. But donÃ¢Â€Â™t default to Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â. ThatÃ¢Â€Â™s premature pessimisation.
â€“Â Konrad Rudolph
May 15 at 15:16

@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â€“Â Graipher
May 15 at 15:34

1

As a drawback, I guess this works only with Hashable values.
â€“Â Josay
May 15 at 16:10

1

@Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
â€“Â Graipher
May 15 at 16:13

Â |Â
show 2 more comments

up vote
15
down vote

accepted

First, let's look at the naive implementation using sets:

def diff_set(before, after):
 b, a = set(before), set(after)
 return list(a - b), list(b - a), list(a & b)

This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?

Here are timings of your three example cases:

 Input diff diff_set diff_counter 
 ------------------------------ ------------------- ------------------- ------------------- 
 , 1.72 Ã‚Âµs Ã‚Â± 29.9 ns 1.21 Ã‚Âµs Ã‚Â± 5.34 ns 9.91 Ã‚Âµs Ã‚Â± 91.4 ns 
 [1,2,3], [4,3,2] 3.16 Ã‚Âµs Ã‚Â± 183 ns 1.57 Ã‚Âµs Ã‚Â± 28.4 ns 12 Ã‚Âµs Ã‚Â± 180 ns 
 ['foo','bar'], ['baz','bar'] 2.77 Ã‚Âµs Ã‚Â± 83.4 ns 1.56 Ã‚Âµs Ã‚Â± 37.7 ns 11.5 Ã‚Âµs Ã‚Â± 240 ns

import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]

Here also the set implementation wins:

enter image description here

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.

For reference, here the implementation of diff_counter I used:

from collections import Counter

def diff_counter(before, after):
 c = Counter(after)
 c.subtract(before)
 return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]

edited May 15 at 16:19

answered May 15 at 14:25

Graipher

20.4k42981

2

Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â€“Â Arminius
May 15 at 14:36

IÃ¢Â€Â™m having a hard time accepting that you call code using sets the Ã¢Â€Âœnaive implementationÃ¢Â€Â, and I consequently donÃ¢Â€Â™t agree with your recommendation. DonÃ¢Â€Â™t try the Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, Ã¢Â€Â¦) approach. But donÃ¢Â€Â™t default to Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â. ThatÃ¢Â€Â™s premature pessimisation.
â€“Â Konrad Rudolph
May 15 at 15:16

@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â€“Â Graipher
May 15 at 15:34

1

As a drawback, I guess this works only with Hashable values.
â€“Â Josay
May 15 at 16:10

1

@Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
â€“Â Graipher
May 15 at 16:13

Â |Â
show 2 more comments

up vote
15
down vote

accepted

First, let's look at the naive implementation using sets:

def diff_set(before, after):
 b, a = set(before), set(after)
 return list(a - b), list(b - a), list(a & b)

This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?

Here are timings of your three example cases:

 Input diff diff_set diff_counter 
 ------------------------------ ------------------- ------------------- ------------------- 
 , 1.72 Ã‚Âµs Ã‚Â± 29.9 ns 1.21 Ã‚Âµs Ã‚Â± 5.34 ns 9.91 Ã‚Âµs Ã‚Â± 91.4 ns 
 [1,2,3], [4,3,2] 3.16 Ã‚Âµs Ã‚Â± 183 ns 1.57 Ã‚Âµs Ã‚Â± 28.4 ns 12 Ã‚Âµs Ã‚Â± 180 ns 
 ['foo','bar'], ['baz','bar'] 2.77 Ã‚Âµs Ã‚Â± 83.4 ns 1.56 Ã‚Âµs Ã‚Â± 37.7 ns 11.5 Ã‚Âµs Ã‚Â± 240 ns

import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]

Here also the set implementation wins:

enter image description here

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.

For reference, here the implementation of diff_counter I used:

from collections import Counter

def diff_counter(before, after):
 c = Counter(after)
 c.subtract(before)
 return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]

edited May 15 at 16:19

answered May 15 at 14:25

Graipher

20.4k42981

First, let's look at the naive implementation using sets:

def diff_set(before, after):
 b, a = set(before), set(after)
 return list(a - b), list(b - a), list(a & b)

This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?

Here are timings of your three example cases:

 Input diff diff_set diff_counter 
 ------------------------------ ------------------- ------------------- ------------------- 
 , 1.72 Ã‚Âµs Ã‚Â± 29.9 ns 1.21 Ã‚Âµs Ã‚Â± 5.34 ns 9.91 Ã‚Âµs Ã‚Â± 91.4 ns 
 [1,2,3], [4,3,2] 3.16 Ã‚Âµs Ã‚Â± 183 ns 1.57 Ã‚Âµs Ã‚Â± 28.4 ns 12 Ã‚Âµs Ã‚Â± 180 ns 
 ['foo','bar'], ['baz','bar'] 2.77 Ã‚Âµs Ã‚Â± 83.4 ns 1.56 Ã‚Âµs Ã‚Â± 37.7 ns 11.5 Ã‚Âµs Ã‚Â± 240 ns

import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]

Here also the set implementation wins:

enter image description here

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.

For reference, here the implementation of diff_counter I used:

from collections import Counter

def diff_counter(before, after):
 c = Counter(after)
 c.subtract(before)
 return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]

edited May 15 at 16:19

answered May 15 at 14:25

Graipher

20.4k42981

edited May 15 at 16:19

answered May 15 at 14:25

Graipher

20.4k42981

answered May 15 at 14:25

Graipher

20.4k42981

answered May 15 at 14:25

Graipher

20.4k42981

2

Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â€“Â Arminius
May 15 at 14:36

IÃ¢Â€Â™m having a hard time accepting that you call code using sets the Ã¢Â€Âœnaive implementationÃ¢Â€Â, and I consequently donÃ¢Â€Â™t agree with your recommendation. DonÃ¢Â€Â™t try the Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, Ã¢Â€Â¦) approach. But donÃ¢Â€Â™t default to Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â. ThatÃ¢Â€Â™s premature pessimisation.
â€“Â Konrad Rudolph
May 15 at 15:16

@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â€“Â Graipher
May 15 at 15:34

1

As a drawback, I guess this works only with Hashable values.
â€“Â Josay
May 15 at 16:10

1

@Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
â€“Â Graipher
May 15 at 16:13

Â |Â
show 2 more comments

2

Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â€“Â Arminius
May 15 at 14:36

IÃ¢Â€Â™m having a hard time accepting that you call code using sets the Ã¢Â€Âœnaive implementationÃ¢Â€Â, and I consequently donÃ¢Â€Â™t agree with your recommendation. DonÃ¢Â€Â™t try the Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, Ã¢Â€Â¦) approach. But donÃ¢Â€Â™t default to Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â. ThatÃ¢Â€Â™s premature pessimisation.
â€“Â Konrad Rudolph
May 15 at 15:16

@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â€“Â Graipher
May 15 at 15:34

1

As a drawback, I guess this works only with Hashable values.
â€“Â Josay
May 15 at 16:10

1

@Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
â€“Â Graipher
May 15 at 16:13

Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â€“Â Arminius
May 15 at 14:36

IÃ¢Â€Â™m having a hard time accepting that you call code using sets the Ã¢Â€Âœnaive implementationÃ¢Â€Â, and I consequently donÃ¢Â€Â™t agree with your recommendation. DonÃ¢Â€Â™t try the Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, Ã¢Â€Â¦) approach. But donÃ¢Â€Â™t default to Ã¢Â€ÂœnaÃ¯veÃ¢Â€Â. ThatÃ¢Â€Â™s premature pessimisation.
â€“Â Konrad Rudolph
May 15 at 15:16

@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â€“Â Graipher
May 15 at 15:34

As a drawback, I guess this works only with Hashable values.
â€“Â Josay
May 15 at 16:10

@Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
â€“Â Graipher
May 15 at 16:13

Â |Â
show 2 more comments

up vote
6
down vote

But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:

new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
 return , , 
elif new is None:
 return , before, 
elif old is None:
 return after, , 

try:
 while True:
 Ã¢Â€Â¦

An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:

>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)

As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.

Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):

>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]

But kept elements doesn't have a nice way to retrieve them:

>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]

answered May 15 at 14:37

Mathias Ettinger

21.8k32875

2

Updated timings in my answer to include this function.
â€“Â Graipher
May 15 at 14:43

add a commentÂ |Â

up vote
6
down vote

But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:

new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
 return , , 
elif new is None:
 return , before, 
elif old is None:
 return after, , 

try:
 while True:
 Ã¢Â€Â¦

An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:

>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)

As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.

Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):

>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]

But kept elements doesn't have a nice way to retrieve them:

>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]

answered May 15 at 14:37

Mathias Ettinger

21.8k32875

2

Updated timings in my answer to include this function.
â€“Â Graipher
May 15 at 14:43

add a commentÂ |Â

up vote
6
down vote

But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:

new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
 return , , 
elif new is None:
 return , before, 
elif old is None:
 return after, , 

try:
 while True:
 Ã¢Â€Â¦

An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:

>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)

As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.

Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):

>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]

But kept elements doesn't have a nice way to retrieve them:

>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]

answered May 15 at 14:37

Mathias Ettinger

21.8k32875

But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:

new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
 return , , 
elif new is None:
 return , before, 
elif old is None:
 return after, , 

try:
 while True:
 Ã¢Â€Â¦

An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:

>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)

As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.

Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):

>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]

But kept elements doesn't have a nice way to retrieve them:

>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]

answered May 15 at 14:37

Mathias Ettinger

21.8k32875

answered May 15 at 14:37

Mathias Ettinger

21.8k32875

answered May 15 at 14:37

Mathias Ettinger

21.8k32875

answered May 15 at 14:37

Mathias Ettinger

21.8k32875

2

Updated timings in my answer to include this function.
â€“Â Graipher
May 15 at 14:43

add a commentÂ |Â

2

Updated timings in my answer to include this function.
â€“Â Graipher
May 15 at 14:43

Updated timings in my answer to include this function.
â€“Â Graipher
May 15 at 14:43

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr