Determine added, removed and kept items between two lists

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
9
down vote

favorite












My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.



def diff(before, after):
"""Return diff between iterables of sortable items `before` and `after` as
tuple of three lists of items that have been added, removed and kept.
"""
before_iter = iter(sorted(before))
after_iter = iter(sorted(after))
add_items =
rem_items =
keep_items =
try:
new = old = None
while True:
if new == old:
if new is not None:
keep_items.append(new)
new = old = None
new = next(after_iter)
old = next(before_iter)
elif new < old:
add_items.append(new)
new = None
new = next(after_iter)
elif old < new:
rem_items.append(old)
old = None
old = next(before_iter)
except StopIteration:
if new is not None:
add_items.append(new)
if old is not None:
rem_items.append(old)
add_items += list(after_iter)
rem_items += list(before_iter)
return add_items, rem_items, keep_items

if __name__ == '__main__':
assert diff(, ) == (, , )
assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])


Main concerns:



  • The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?


  • The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)



Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:



  • The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.






share|improve this question





















  • If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
    – Alex Hall
    May 16 at 12:39
















up vote
9
down vote

favorite












My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.



def diff(before, after):
"""Return diff between iterables of sortable items `before` and `after` as
tuple of three lists of items that have been added, removed and kept.
"""
before_iter = iter(sorted(before))
after_iter = iter(sorted(after))
add_items =
rem_items =
keep_items =
try:
new = old = None
while True:
if new == old:
if new is not None:
keep_items.append(new)
new = old = None
new = next(after_iter)
old = next(before_iter)
elif new < old:
add_items.append(new)
new = None
new = next(after_iter)
elif old < new:
rem_items.append(old)
old = None
old = next(before_iter)
except StopIteration:
if new is not None:
add_items.append(new)
if old is not None:
rem_items.append(old)
add_items += list(after_iter)
rem_items += list(before_iter)
return add_items, rem_items, keep_items

if __name__ == '__main__':
assert diff(, ) == (, , )
assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])


Main concerns:



  • The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?


  • The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)



Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:



  • The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.






share|improve this question





















  • If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
    – Alex Hall
    May 16 at 12:39












up vote
9
down vote

favorite









up vote
9
down vote

favorite











My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.



def diff(before, after):
"""Return diff between iterables of sortable items `before` and `after` as
tuple of three lists of items that have been added, removed and kept.
"""
before_iter = iter(sorted(before))
after_iter = iter(sorted(after))
add_items =
rem_items =
keep_items =
try:
new = old = None
while True:
if new == old:
if new is not None:
keep_items.append(new)
new = old = None
new = next(after_iter)
old = next(before_iter)
elif new < old:
add_items.append(new)
new = None
new = next(after_iter)
elif old < new:
rem_items.append(old)
old = None
old = next(before_iter)
except StopIteration:
if new is not None:
add_items.append(new)
if old is not None:
rem_items.append(old)
add_items += list(after_iter)
rem_items += list(before_iter)
return add_items, rem_items, keep_items

if __name__ == '__main__':
assert diff(, ) == (, , )
assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])


Main concerns:



  • The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?


  • The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)



Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:



  • The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.






share|improve this question













My code is determining the difference between two iterables in terms of items that have been addded, removed and kept. The items are assumed to be sortable and there are no duplicates.



def diff(before, after):
"""Return diff between iterables of sortable items `before` and `after` as
tuple of three lists of items that have been added, removed and kept.
"""
before_iter = iter(sorted(before))
after_iter = iter(sorted(after))
add_items =
rem_items =
keep_items =
try:
new = old = None
while True:
if new == old:
if new is not None:
keep_items.append(new)
new = old = None
new = next(after_iter)
old = next(before_iter)
elif new < old:
add_items.append(new)
new = None
new = next(after_iter)
elif old < new:
rem_items.append(old)
old = None
old = next(before_iter)
except StopIteration:
if new is not None:
add_items.append(new)
if old is not None:
rem_items.append(old)
add_items += list(after_iter)
rem_items += list(before_iter)
return add_items, rem_items, keep_items

if __name__ == '__main__':
assert diff(, ) == (, , )
assert diff([1, 2, 3], [4, 3, 2]) == ([4], [1], [2, 3])
assert diff(['foo', 'bar'], ['baz', 'bar']) == (['baz'], ['foo'], ['bar'])


Main concerns:



  • The way I'm working with iterators seems a bit awkward and appears to unnecessarily bloat the code. I feel there must be a more elegant pattern, maybe without iterators at all?


  • The code should perform well on large lists. That's also why I'm not using multiple set operations instead. (add_items = set(after) - set(before), etc.) Or can I solve this efficiently with set operations? (I'm expecting both very small and very large differences between the iterables.)



Update: Unfortunately, I failed to include a key requirement into the question which therefore couldn't be considered in the answers. But I'll state it here, just for the record:



  • The algorithm needs to perform well when comparison between the items is expensive. Since an approach using set operations incorporates more comparisons than the algorithm above, it performs worse in my use case.








share|improve this question












share|improve this question




share|improve this question








edited May 15 at 19:07
























asked May 15 at 13:52









Arminius

1516




1516











  • If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
    – Alex Hall
    May 16 at 12:39
















  • If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
    – Alex Hall
    May 16 at 12:39















If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
– Alex Hall
May 16 at 12:39




If the problem is expensive comparisons, maybe reduce each object down to a small key first. For example, you could take the SHA256 of a large text document.
– Alex Hall
May 16 at 12:39










2 Answers
2






active

oldest

votes

















up vote
15
down vote



accepted










First, let's look at the naive implementation using sets:



def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)


This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?



Here are timings of your three example cases:



 Input diff diff_set diff_counter 
------------------------------ ------------------- ------------------- -------------------
, 1.72 µs ± 29.9 ns 1.21 µs ± 5.34 ns 9.91 µs ± 91.4 ns
[1,2,3], [4,3,2] 3.16 µs ± 183 ns 1.57 µs ± 28.4 ns 12 µs ± 180 ns
['foo','bar'], ['baz','bar'] 2.77 µs ± 83.4 ns 1.56 µs ± 37.7 ns 11.5 µs ± 240 ns


The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.



import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]


Here also the set implementation wins:



enter image description here



In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.




For reference, here the implementation of diff_counter I used:



from collections import Counter

def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]





share|improve this answer



















  • 2




    Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
    – Arminius
    May 15 at 14:36










  • I’m having a hard time accepting that you call code using sets the “naive implementation”, and I consequently don’t agree with your recommendation. Don’t try the “naïve” approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, …) approach. But don’t default to “naïve”. That’s premature pessimisation.
    – Konrad Rudolph
    May 15 at 15:16











  • @KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
    – Graipher
    May 15 at 15:34







  • 1




    As a drawback, I guess this works only with Hashable values.
    – Josay
    May 15 at 16:10






  • 1




    @Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
    – Graipher
    May 15 at 16:13

















up vote
6
down vote













With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.



But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:



new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,

try:
while True:
…


However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.



An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:



>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)


As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.



Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):



>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]


But kept elements doesn't have a nice way to retrieve them:



>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]





share|improve this answer

















  • 2




    Updated timings in my answer to include this function.
    – Graipher
    May 15 at 14:43










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194459%2fdetermine-added-removed-and-kept-items-between-two-lists%23new-answer', 'question_page');

);

Post as a guest






























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
15
down vote



accepted










First, let's look at the naive implementation using sets:



def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)


This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?



Here are timings of your three example cases:



 Input diff diff_set diff_counter 
------------------------------ ------------------- ------------------- -------------------
, 1.72 µs ± 29.9 ns 1.21 µs ± 5.34 ns 9.91 µs ± 91.4 ns
[1,2,3], [4,3,2] 3.16 µs ± 183 ns 1.57 µs ± 28.4 ns 12 µs ± 180 ns
['foo','bar'], ['baz','bar'] 2.77 µs ± 83.4 ns 1.56 µs ± 37.7 ns 11.5 µs ± 240 ns


The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.



import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]


Here also the set implementation wins:



enter image description here



In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.




For reference, here the implementation of diff_counter I used:



from collections import Counter

def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]





share|improve this answer



















  • 2




    Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
    – Arminius
    May 15 at 14:36










  • I’m having a hard time accepting that you call code using sets the “naive implementation”, and I consequently don’t agree with your recommendation. Don’t try the “naïve” approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, …) approach. But don’t default to “naïve”. That’s premature pessimisation.
    – Konrad Rudolph
    May 15 at 15:16











  • @KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
    – Graipher
    May 15 at 15:34







  • 1




    As a drawback, I guess this works only with Hashable values.
    – Josay
    May 15 at 16:10






  • 1




    @Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
    – Graipher
    May 15 at 16:13














up vote
15
down vote



accepted










First, let's look at the naive implementation using sets:



def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)


This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?



Here are timings of your three example cases:



 Input diff diff_set diff_counter 
------------------------------ ------------------- ------------------- -------------------
, 1.72 µs ± 29.9 ns 1.21 µs ± 5.34 ns 9.91 µs ± 91.4 ns
[1,2,3], [4,3,2] 3.16 µs ± 183 ns 1.57 µs ± 28.4 ns 12 µs ± 180 ns
['foo','bar'], ['baz','bar'] 2.77 µs ± 83.4 ns 1.56 µs ± 37.7 ns 11.5 µs ± 240 ns


The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.



import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]


Here also the set implementation wins:



enter image description here



In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.




For reference, here the implementation of diff_counter I used:



from collections import Counter

def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]





share|improve this answer



















  • 2




    Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
    – Arminius
    May 15 at 14:36










  • I’m having a hard time accepting that you call code using sets the “naive implementation”, and I consequently don’t agree with your recommendation. Don’t try the “naïve” approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, …) approach. But don’t default to “naïve”. That’s premature pessimisation.
    – Konrad Rudolph
    May 15 at 15:16











  • @KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
    – Graipher
    May 15 at 15:34







  • 1




    As a drawback, I guess this works only with Hashable values.
    – Josay
    May 15 at 16:10






  • 1




    @Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
    – Graipher
    May 15 at 16:13












up vote
15
down vote



accepted







up vote
15
down vote



accepted






First, let's look at the naive implementation using sets:



def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)


This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?



Here are timings of your three example cases:



 Input diff diff_set diff_counter 
------------------------------ ------------------- ------------------- -------------------
, 1.72 µs ± 29.9 ns 1.21 µs ± 5.34 ns 9.91 µs ± 91.4 ns
[1,2,3], [4,3,2] 3.16 µs ± 183 ns 1.57 µs ± 28.4 ns 12 µs ± 180 ns
['foo','bar'], ['baz','bar'] 2.77 µs ± 83.4 ns 1.56 µs ± 37.7 ns 11.5 µs ± 240 ns


The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.



import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]


Here also the set implementation wins:



enter image description here



In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.




For reference, here the implementation of diff_counter I used:



from collections import Counter

def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]





share|improve this answer















First, let's look at the naive implementation using sets:



def diff_set(before, after):
b, a = set(before), set(after)
return list(a - b), list(b - a), list(a & b)


This is (so much more) readable and understandable than your code. The question is, is this significantly slower than your code?



Here are timings of your three example cases:



 Input diff diff_set diff_counter 
------------------------------ ------------------- ------------------- -------------------
, 1.72 µs ± 29.9 ns 1.21 µs ± 5.34 ns 9.91 µs ± 91.4 ns
[1,2,3], [4,3,2] 3.16 µs ± 183 ns 1.57 µs ± 28.4 ns 12 µs ± 180 ns
['foo','bar'], ['baz','bar'] 2.77 µs ± 83.4 ns 1.56 µs ± 37.7 ns 11.5 µs ± 240 ns


The set implementation is faster in all three cases. But what about the scaling behavior? First lets set up some random test data of increasing length. Note that I scaled the number of available values up with the size of the lists, so that set(before), set(after) does not become trivial and so that there are both added and removed values. The only way in which this might be unrealistic is that both before and after are always of the same size.



import random
random.seed(42)
random_data = [([random.randrange(2*n) for _ in range(n)], [random.randrange(2*n) for _ in range(n)]) for n in range(1, 10000, 100)]


Here also the set implementation wins:



enter image description here



In other words: always try the naive approach first and then test to see if the more complicated approach actually improves anything.




For reference, here the implementation of diff_counter I used:



from collections import Counter

def diff_counter(before, after):
c = Counter(after)
c.subtract(before)
return list(+c), list(-c), [elem for elem, count in c.items() if count == 0]






share|improve this answer















share|improve this answer



share|improve this answer








edited May 15 at 16:19


























answered May 15 at 14:25









Graipher

20.4k42981




20.4k42981







  • 2




    Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
    – Arminius
    May 15 at 14:36










  • I’m having a hard time accepting that you call code using sets the “naive implementation”, and I consequently don’t agree with your recommendation. Don’t try the “naïve” approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, …) approach. But don’t default to “naïve”. That’s premature pessimisation.
    – Konrad Rudolph
    May 15 at 15:16











  • @KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
    – Graipher
    May 15 at 15:34







  • 1




    As a drawback, I guess this works only with Hashable values.
    – Josay
    May 15 at 16:10






  • 1




    @Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
    – Graipher
    May 15 at 16:13












  • 2




    Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
    – Arminius
    May 15 at 14:36










  • I’m having a hard time accepting that you call code using sets the “naive implementation”, and I consequently don’t agree with your recommendation. Don’t try the “naïve” approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, …) approach. But don’t default to “naïve”. That’s premature pessimisation.
    – Konrad Rudolph
    May 15 at 15:16











  • @KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
    – Graipher
    May 15 at 15:34







  • 1




    As a drawback, I guess this works only with Hashable values.
    – Josay
    May 15 at 16:10






  • 1




    @Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
    – Graipher
    May 15 at 16:13







2




2




Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
– Arminius
May 15 at 14:36




Thanks for your analysis! I'm surprised because I (obviously) did try the naive approach first and concluded that it does not scale well. I'll see if I can follow that up with an explanation what was going on.
– Arminius
May 15 at 14:36












I’m having a hard time accepting that you call code using sets the “naive implementation”, and I consequently don’t agree with your recommendation. Don’t try the “naïve” approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, …) approach. But don’t default to “naïve”. That’s premature pessimisation.
– Konrad Rudolph
May 15 at 15:16





I’m having a hard time accepting that you call code using sets the “naive implementation”, and I consequently don’t agree with your recommendation. Don’t try the “naïve” approach first. Analyse the problem, get at least a fundamental understanding of the performance implications of various approaches, and see which approach might work. This may well be the simplest (shortest, …) approach. But don’t default to “naïve”. That’s premature pessimisation.
– Konrad Rudolph
May 15 at 15:16













@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
– Graipher
May 15 at 15:34





@KonradRudolph If you read my recommendation carefully you will discover that I only recommend implementing it in the straight-forward way first, then try to come up with a better way and measure the difference (which I did in this post). One could even argue that one should only try to think of a more clever way if the naive way is not fast enough (no premature optimizations, which here turned out to be actually worse).
– Graipher
May 15 at 15:34





1




1




As a drawback, I guess this works only with Hashable values.
– Josay
May 15 at 16:10




As a drawback, I guess this works only with Hashable values.
– Josay
May 15 at 16:10




1




1




@Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
– Graipher
May 15 at 16:13




@Josay: True. But since the OP already acknowledged that sets were an option for him and all their sample cases are hashable this will hopefully be a valid assumption. Also, the Counter way would have the same problem.
– Graipher
May 15 at 16:13












up vote
6
down vote













With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.



But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:



new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,

try:
while True:
…


However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.



An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:



>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)


As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.



Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):



>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]


But kept elements doesn't have a nice way to retrieve them:



>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]





share|improve this answer

















  • 2




    Updated timings in my answer to include this function.
    – Graipher
    May 15 at 14:43














up vote
6
down vote













With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.



But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:



new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,

try:
while True:
…


However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.



An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:



>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)


As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.



Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):



>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]


But kept elements doesn't have a nice way to retrieve them:



>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]





share|improve this answer

















  • 2




    Updated timings in my answer to include this function.
    – Graipher
    May 15 at 14:43












up vote
6
down vote










up vote
6
down vote









With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.



But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:



new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,

try:
while True:
…


However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.



An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:



>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)


As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.



Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):



>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]


But kept elements doesn't have a nice way to retrieve them:



>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]





share|improve this answer













With the current algorithm, the code seems as good as it can get. There is only the if new == old: if new is not None: check that bothered me a bit: a comment saying that this second check accounted for the initial pass in the loop would have made things clearer.



But as it only account for that pass alone, why check this condition each time? You could extract this special case out of the while to simplify the treatment:



new = next(after_iter, None)
old = next(before_iter, None)
if new is None and old is None:
return , ,
elif new is None:
return , before,
elif old is None:
return after, ,

try:
while True:
…


However, since sorted returns a list of the elements in the iterables, you end up consuming them entirely before processing them, hence making the use of iterator only usefull for the algorithm and not necessarily for its memory footprint.



An other approach, especially given the fact that there would be no duplicates, is to use a collections.Counter to do the whole work for you:



>>> before = [1, 2, 4, 5, 7]
>>> after = [3, 4, 5, 6, 2]
>>> c = Counter(after)
>>> c.subtract(before)
>>> c
Counter(3: 1, 6: 1, 4: 0, 5: 0, 2: 0, 1: -1, 7: -1)


As you can see, elements added have a count of 1, elements kept have a count of 0 and elements removed have a count of -1.



Retrieving added or removed elements is really easy (you can replace list with sorted if you want to keep your behaviour of returning sorted arrays):



>>> list(+c)
[3, 6]
>>> list(-c)
[1, 7]


But kept elements doesn't have a nice way to retrieve them:



>>> [elem for elem, count in c.items() if count == 0]
[4, 5, 2]






share|improve this answer













share|improve this answer



share|improve this answer











answered May 15 at 14:37









Mathias Ettinger

21.8k32875




21.8k32875







  • 2




    Updated timings in my answer to include this function.
    – Graipher
    May 15 at 14:43












  • 2




    Updated timings in my answer to include this function.
    – Graipher
    May 15 at 14:43







2




2




Updated timings in my answer to include this function.
– Graipher
May 15 at 14:43




Updated timings in my answer to include this function.
– Graipher
May 15 at 14:43












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194459%2fdetermine-added-removed-and-kept-items-between-two-lists%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Python Lists

Aion

JavaScript Array Iteration Methods