Determine added, removed and kept items between two lists

Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
9
down vote
favorite
My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.
def diff(before, after):
"""Return diff between iterables of sortable items `before` and `after` as
tuple of three lists of items that have been added, removed and kept.
"""
before_iter = iter(sorted(before))
after_iter = iter(sorted(after))
add_items =
rem_items =
keep_items =
try:
new = old = None
while True:
if new == old:
if new is not None:
keep_items.append(new)
new = old = None
new = next(after_iter)
old = next(before_iter)
elif new < old:
add_items.append(new)
new = None
new = next(after_iter)
elif old < new:
rem_items.append(old)
old = None
old = next(before_iter)
except StopIteration:
if new is not None:
add_items.append(new)
if old is not None:
rem_items.append(old)
add_items += list(after_iter)
rem_items += list(before_iter)
return add_items, rem_items, keep_items
if __name__ == '__main__':
assert diff(, ) == (, , )
assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])
Main concerns:
The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?
The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (
add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)
Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:
- The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.
python algorithm python-3.x iterator
add a comment |Â
up vote
9
down vote
favorite
My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.
def diff(before, after):
"""Return diff between iterables of sortable items `before` and `after` as
tuple of three lists of items that have been added, removed and kept.
"""
before_iter = iter(sorted(before))
after_iter = iter(sorted(after))
add_items =
rem_items =
keep_items =
try:
new = old = None
while True:
if new == old:
if new is not None:
keep_items.append(new)
new = old = None
new = next(after_iter)
old = next(before_iter)
elif new < old:
add_items.append(new)
new = None
new = next(after_iter)
elif old < new:
rem_items.append(old)
old = None
old = next(before_iter)
except StopIteration:
if new is not None:
add_items.append(new)
if old is not None:
rem_items.append(old)
add_items += list(after_iter)
rem_items += list(before_iter)
return add_items, rem_items, keep_items
if __name__ == '__main__':
assert diff(, ) == (, , )
assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])
Main concerns:
The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?
The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (
add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)
Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:
- The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.
python algorithm python-3.x iterator
If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â Alex Hall
May 16 at 12:39
add a comment |Â
up vote
9
down vote
favorite
up vote
9
down vote
favorite
My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.
def diff(before, after):
"""Return diff between iterables of sortable items `before` and `after` as
tuple of three lists of items that have been added, removed and kept.
"""
before_iter = iter(sorted(before))
after_iter = iter(sorted(after))
add_items =
rem_items =
keep_items =
try:
new = old = None
while True:
if new == old:
if new is not None:
keep_items.append(new)
new = old = None
new = next(after_iter)
old = next(before_iter)
elif new < old:
add_items.append(new)
new = None
new = next(after_iter)
elif old < new:
rem_items.append(old)
old = None
old = next(before_iter)
except StopIteration:
if new is not None:
add_items.append(new)
if old is not None:
rem_items.append(old)
add_items += list(after_iter)
rem_items += list(before_iter)
return add_items, rem_items, keep_items
if __name__ == '__main__':
assert diff(, ) == (, , )
assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])
Main concerns:
The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?
The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (
add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)
Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:
- The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.
python algorithm python-3.x iterator
My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.
def diff(before, after):
"""Return diff between iterables of sortable items `before` and `after` as
tuple of three lists of items that have been added, removed and kept.
"""
before_iter = iter(sorted(before))
after_iter = iter(sorted(after))
add_items =
rem_items =
keep_items =
try:
new = old = None
while True:
if new == old:
if new is not None:
keep_items.append(new)
new = old = None
new = next(after_iter)
old = next(before_iter)
elif new < old:
add_items.append(new)
new = None
new = next(after_iter)
elif old < new:
rem_items.append(old)
old = None
old = next(before_iter)
except StopIteration:
if new is not None:
add_items.append(new)
if old is not None:
rem_items.append(old)
add_items += list(after_iter)
rem_items += list(before_iter)
return add_items, rem_items, keep_items
if __name__ == '__main__':
assert diff(, ) == (, , )
assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])
Main concerns:
The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?
The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (
add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)
Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:
- The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.
python algorithm python-3.x iterator
edited May 15 at 19:07
asked May 15 at 13:52
Arminius
1516
1516
If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â Alex Hall
May 16 at 12:39
add a comment |Â
If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â Alex Hall
May 16 at 12:39
If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â Alex Hall
May 16 at 12:39
If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â Alex Hall
May 16 at 12:39
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
15
down vote
accepted
First, let's look at the naive implementation using sets:
def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)
This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?
Here are timings of your three example cases:
Input diff diff_set diff_counter
------------------------------ ------------------- ------------------- -------------------
, 1.72 õs ñ 29.9 ns 1.21 õs ñ 5.34 ns 9.91 õs ñ 91.4 ns
[1,2,3], [4,3,2] 3.16 õs ñ 183 ns 1.57 õs ñ 28.4 ns 12 õs ñ 180 ns
['foo','bar'], ['baz','bar'] 2.77 õs ñ 83.4 ns 1.56 õs ñ 37.7 ns 11.5 õs ñ 240 ns
The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.
import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]
Here also the set implementation wins:

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.
For reference, here the implementation of diff_counter I used:
from collections import Counter
def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]
2
Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â Arminius
May 15 at 14:36
IâÂÂm having a hard time accepting that you call code using sets the âÂÂnaive implementationâÂÂ, and I consequently donâÂÂt agree with your recommendation. DonâÂÂt try the âÂÂnaïveâ approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, â¦) approach. But donâÂÂt default to âÂÂnaïveâÂÂ. ThatâÂÂs premature pessimisation.
â Konrad Rudolph
May 15 at 15:16
@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â Graipher
May 15 at 15:34
1
As a drawback, I guess this works only with Hashable values.
â Josay
May 15 at 16:10
1
@Josay: True. But since the OP already acknowledged thatsets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, theCounterway would have the same problem.
â Graipher
May 15 at 16:13
 |Â
show 2 more comments
up vote
6
down vote
With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.
But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:
new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,
try:
while True:
â¦
However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.
An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:
>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)
As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.
Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):
>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]
But kept elements doesn't have a nice way to retrieve them:
>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]
2
Updated timings in my answer to include this function.
â Graipher
May 15 at 14:43
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
15
down vote
accepted
First, let's look at the naive implementation using sets:
def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)
This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?
Here are timings of your three example cases:
Input diff diff_set diff_counter
------------------------------ ------------------- ------------------- -------------------
, 1.72 õs ñ 29.9 ns 1.21 õs ñ 5.34 ns 9.91 õs ñ 91.4 ns
[1,2,3], [4,3,2] 3.16 õs ñ 183 ns 1.57 õs ñ 28.4 ns 12 õs ñ 180 ns
['foo','bar'], ['baz','bar'] 2.77 õs ñ 83.4 ns 1.56 õs ñ 37.7 ns 11.5 õs ñ 240 ns
The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.
import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]
Here also the set implementation wins:

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.
For reference, here the implementation of diff_counter I used:
from collections import Counter
def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]
2
Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â Arminius
May 15 at 14:36
IâÂÂm having a hard time accepting that you call code using sets the âÂÂnaive implementationâÂÂ, and I consequently donâÂÂt agree with your recommendation. DonâÂÂt try the âÂÂnaïveâ approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, â¦) approach. But donâÂÂt default to âÂÂnaïveâÂÂ. ThatâÂÂs premature pessimisation.
â Konrad Rudolph
May 15 at 15:16
@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â Graipher
May 15 at 15:34
1
As a drawback, I guess this works only with Hashable values.
â Josay
May 15 at 16:10
1
@Josay: True. But since the OP already acknowledged thatsets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, theCounterway would have the same problem.
â Graipher
May 15 at 16:13
 |Â
show 2 more comments
up vote
15
down vote
accepted
First, let's look at the naive implementation using sets:
def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)
This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?
Here are timings of your three example cases:
Input diff diff_set diff_counter
------------------------------ ------------------- ------------------- -------------------
, 1.72 õs ñ 29.9 ns 1.21 õs ñ 5.34 ns 9.91 õs ñ 91.4 ns
[1,2,3], [4,3,2] 3.16 õs ñ 183 ns 1.57 õs ñ 28.4 ns 12 õs ñ 180 ns
['foo','bar'], ['baz','bar'] 2.77 õs ñ 83.4 ns 1.56 õs ñ 37.7 ns 11.5 õs ñ 240 ns
The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.
import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]
Here also the set implementation wins:

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.
For reference, here the implementation of diff_counter I used:
from collections import Counter
def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]
2
Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â Arminius
May 15 at 14:36
IâÂÂm having a hard time accepting that you call code using sets the âÂÂnaive implementationâÂÂ, and I consequently donâÂÂt agree with your recommendation. DonâÂÂt try the âÂÂnaïveâ approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, â¦) approach. But donâÂÂt default to âÂÂnaïveâÂÂ. ThatâÂÂs premature pessimisation.
â Konrad Rudolph
May 15 at 15:16
@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â Graipher
May 15 at 15:34
1
As a drawback, I guess this works only with Hashable values.
â Josay
May 15 at 16:10
1
@Josay: True. But since the OP already acknowledged thatsets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, theCounterway would have the same problem.
â Graipher
May 15 at 16:13
 |Â
show 2 more comments
up vote
15
down vote
accepted
up vote
15
down vote
accepted
First, let's look at the naive implementation using sets:
def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)
This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?
Here are timings of your three example cases:
Input diff diff_set diff_counter
------------------------------ ------------------- ------------------- -------------------
, 1.72 õs ñ 29.9 ns 1.21 õs ñ 5.34 ns 9.91 õs ñ 91.4 ns
[1,2,3], [4,3,2] 3.16 õs ñ 183 ns 1.57 õs ñ 28.4 ns 12 õs ñ 180 ns
['foo','bar'], ['baz','bar'] 2.77 õs ñ 83.4 ns 1.56 õs ñ 37.7 ns 11.5 õs ñ 240 ns
The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.
import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]
Here also the set implementation wins:

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.
For reference, here the implementation of diff_counter I used:
from collections import Counter
def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]
First, let's look at the naive implementation using sets:
def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)
This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?
Here are timings of your three example cases:
Input diff diff_set diff_counter
------------------------------ ------------------- ------------------- -------------------
, 1.72 õs ñ 29.9 ns 1.21 õs ñ 5.34 ns 9.91 õs ñ 91.4 ns
[1,2,3], [4,3,2] 3.16 õs ñ 183 ns 1.57 õs ñ 28.4 ns 12 õs ñ 180 ns
['foo','bar'], ['baz','bar'] 2.77 õs ñ 83.4 ns 1.56 õs ñ 37.7 ns 11.5 õs ñ 240 ns
The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.
import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]
Here also the set implementation wins:

In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.
For reference, here the implementation of diff_counter I used:
from collections import Counter
def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]
edited May 15 at 16:19
answered May 15 at 14:25
Graipher
20.4k42981
20.4k42981
2
Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â Arminius
May 15 at 14:36
IâÂÂm having a hard time accepting that you call code using sets the âÂÂnaive implementationâÂÂ, and I consequently donâÂÂt agree with your recommendation. DonâÂÂt try the âÂÂnaïveâ approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, â¦) approach. But donâÂÂt default to âÂÂnaïveâÂÂ. ThatâÂÂs premature pessimisation.
â Konrad Rudolph
May 15 at 15:16
@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â Graipher
May 15 at 15:34
1
As a drawback, I guess this works only with Hashable values.
â Josay
May 15 at 16:10
1
@Josay: True. But since the OP already acknowledged thatsets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, theCounterway would have the same problem.
â Graipher
May 15 at 16:13
 |Â
show 2 more comments
2
Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â Arminius
May 15 at 14:36
IâÂÂm having a hard time accepting that you call code using sets the âÂÂnaive implementationâÂÂ, and I consequently donâÂÂt agree with your recommendation. DonâÂÂt try the âÂÂnaïveâ approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, â¦) approach. But donâÂÂt default to âÂÂnaïveâÂÂ. ThatâÂÂs premature pessimisation.
â Konrad Rudolph
May 15 at 15:16
@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â Graipher
May 15 at 15:34
1
As a drawback, I guess this works only with Hashable values.
â Josay
May 15 at 16:10
1
@Josay: True. But since the OP already acknowledged thatsets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, theCounterway would have the same problem.
â Graipher
May 15 at 16:13
2
2
Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â Arminius
May 15 at 14:36
Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
â Arminius
May 15 at 14:36
IâÂÂm having a hard time accepting that you call code using sets the âÂÂnaive implementationâÂÂ, and I consequently donâÂÂt agree with your recommendation. DonâÂÂt try the âÂÂnaïveâ approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, â¦) approach. But donâÂÂt default to âÂÂnaïveâÂÂ. ThatâÂÂs premature pessimisation.
â Konrad Rudolph
May 15 at 15:16
IâÂÂm having a hard time accepting that you call code using sets the âÂÂnaive implementationâÂÂ, and I consequently donâÂÂt agree with your recommendation. DonâÂÂt try the âÂÂnaïveâ approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, â¦) approach. But donâÂÂt default to âÂÂnaïveâÂÂ. ThatâÂÂs premature pessimisation.
â Konrad Rudolph
May 15 at 15:16
@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â Graipher
May 15 at 15:34
@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
â Graipher
May 15 at 15:34
1
1
As a drawback, I guess this works only with Hashable values.
â Josay
May 15 at 16:10
As a drawback, I guess this works only with Hashable values.
â Josay
May 15 at 16:10
1
1
@Josay: True. But since the OP already acknowledged that
sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.â Graipher
May 15 at 16:13
@Josay: True. But since the OP already acknowledged that
sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.â Graipher
May 15 at 16:13
 |Â
show 2 more comments
up vote
6
down vote
With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.
But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:
new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,
try:
while True:
â¦
However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.
An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:
>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)
As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.
Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):
>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]
But kept elements doesn't have a nice way to retrieve them:
>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]
2
Updated timings in my answer to include this function.
â Graipher
May 15 at 14:43
add a comment |Â
up vote
6
down vote
With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.
But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:
new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,
try:
while True:
â¦
However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.
An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:
>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)
As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.
Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):
>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]
But kept elements doesn't have a nice way to retrieve them:
>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]
2
Updated timings in my answer to include this function.
â Graipher
May 15 at 14:43
add a comment |Â
up vote
6
down vote
up vote
6
down vote
With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.
But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:
new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,
try:
while True:
â¦
However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.
An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:
>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)
As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.
Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):
>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]
But kept elements doesn't have a nice way to retrieve them:
>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]
With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.
But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:
new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,
try:
while True:
â¦
However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.
An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:
>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)
As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.
Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):
>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]
But kept elements doesn't have a nice way to retrieve them:
>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]
answered May 15 at 14:37
Mathias Ettinger
21.8k32875
21.8k32875
2
Updated timings in my answer to include this function.
â Graipher
May 15 at 14:43
add a comment |Â
2
Updated timings in my answer to include this function.
â Graipher
May 15 at 14:43
2
2
Updated timings in my answer to include this function.
â Graipher
May 15 at 14:43
Updated timings in my answer to include this function.
â Graipher
May 15 at 14:43
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194459%2fdetermine-added-removed-and-kept-items-between-two-lists%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
â Alex Hall
May 16 at 12:39