Count how many times a value is greater than or equal to the median of previous D elements

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
7
down vote

favorite












I am self-studying algorithms and data structures. So, bear with me if the code I've written isn't stellar or interpretation is accurate.



Essentially, the problem (which I took from HackerRank) requires counting the number of times an element in an array is greater than or equal to two times the median of the subarray containing the past D elements.



For instance, suppose you have an array of [2,3,4,2,3,6,8,4,5]. And, the window is D = 5.



Starting value to compare against the median of the subarray is 6. So, you
look at the median of the subarray, [2,3,4,2,3], which is 3. Then, you multiply the median by 2, which is 6, and since the 2 x median value satisfies the condition, you would increment a counter by 1. You continue this procedure all the way to the last element of the array.



So, here's the code I have written thus far:



import sys

def activityNotifications(expenditure, d):
""" Worst-case: O( n^2 * logn ) """
num_of_trans = len(expenditure) - 1
count = 0
for i in range(d,num_of_trans):
prev_trans = sorted(expenditure[i-d:i])
if d % 2 == 0:
left_i = (d - 1) // 2
right_i = left_i + 1
median = prev_trans[left_i] + prev_trans[right_i]
else:
median = prev_trans[d - 1]
if expenditure[i] >= median:
count += 1
return count

if __name__ == "__main__":

expenditures = [[2,3,4,2,3,6,8,4,5],[1,2,3,4,4]]
d = 4

for record in expenditures:
result = activityNotifications(record, d)
print(result)


The approach works, but I am afraid that the code is poorly optimized. If my understanding of asymptotic complexity of this algorithm is correct, I believe it's $O(n ^ 2 * log(n))$. Is there any way I can reduce the complexity down to $O(n)$ or even better $O(log(n))$?



** Update (01/11/18) **



Based on Austin Hasting's suggestion, I revised my code to utilize bisect functions to avoid sorting at every new window. Essentially, the first window is sorted. Then, after the median of the sorted array is computed, bisect finds the position of the leftmost element of the window in the sorted array. Based on the position, the element within the sorted array is deleted. Finally, the next element is added onto the sorted array using bisect once again. Based on the procedure I have outlined, here's my new code:



def median(a, d):
if (d % 2) != 0:
return a[d // 2]
else:
half = d // 2
return (a[half] + a[half - 1]) / 2

def activityNotifications_bisect(expenditure, d):
""" Worst-case: O( n ) """

frauds = 0
window = sorted(expenditure[:d])
for p, i in enumerate(expenditure[d:]):
if i >= median(window, d) * 2:
frauds += 1
del window[bisect_left(window, expenditure[p])]
insort_left(window, i)
return frauds


Theoretically speaking, the code appears to be an improvement. However, when I tested the performance of my old code versus the new using timeit with 1 million executions, I was surprised to see that the old code completing at 5.35 seconds beats the new one by 2.1 seconds. I am perplexed about what's preventing my new code from performing better than the old one. Any additional tips would be incredibly helpful!







share|improve this question





















  • Is this written for Python 2 or 3?
    – Mast
    Jan 8 at 8:12










  • Can you add the link of the challenge, I can't seem to find it
    – Ludisposed
    Jan 8 at 13:15
















up vote
7
down vote

favorite












I am self-studying algorithms and data structures. So, bear with me if the code I've written isn't stellar or interpretation is accurate.



Essentially, the problem (which I took from HackerRank) requires counting the number of times an element in an array is greater than or equal to two times the median of the subarray containing the past D elements.



For instance, suppose you have an array of [2,3,4,2,3,6,8,4,5]. And, the window is D = 5.



Starting value to compare against the median of the subarray is 6. So, you
look at the median of the subarray, [2,3,4,2,3], which is 3. Then, you multiply the median by 2, which is 6, and since the 2 x median value satisfies the condition, you would increment a counter by 1. You continue this procedure all the way to the last element of the array.



So, here's the code I have written thus far:



import sys

def activityNotifications(expenditure, d):
""" Worst-case: O( n^2 * logn ) """
num_of_trans = len(expenditure) - 1
count = 0
for i in range(d,num_of_trans):
prev_trans = sorted(expenditure[i-d:i])
if d % 2 == 0:
left_i = (d - 1) // 2
right_i = left_i + 1
median = prev_trans[left_i] + prev_trans[right_i]
else:
median = prev_trans[d - 1]
if expenditure[i] >= median:
count += 1
return count

if __name__ == "__main__":

expenditures = [[2,3,4,2,3,6,8,4,5],[1,2,3,4,4]]
d = 4

for record in expenditures:
result = activityNotifications(record, d)
print(result)


The approach works, but I am afraid that the code is poorly optimized. If my understanding of asymptotic complexity of this algorithm is correct, I believe it's $O(n ^ 2 * log(n))$. Is there any way I can reduce the complexity down to $O(n)$ or even better $O(log(n))$?



** Update (01/11/18) **



Based on Austin Hasting's suggestion, I revised my code to utilize bisect functions to avoid sorting at every new window. Essentially, the first window is sorted. Then, after the median of the sorted array is computed, bisect finds the position of the leftmost element of the window in the sorted array. Based on the position, the element within the sorted array is deleted. Finally, the next element is added onto the sorted array using bisect once again. Based on the procedure I have outlined, here's my new code:



def median(a, d):
if (d % 2) != 0:
return a[d // 2]
else:
half = d // 2
return (a[half] + a[half - 1]) / 2

def activityNotifications_bisect(expenditure, d):
""" Worst-case: O( n ) """

frauds = 0
window = sorted(expenditure[:d])
for p, i in enumerate(expenditure[d:]):
if i >= median(window, d) * 2:
frauds += 1
del window[bisect_left(window, expenditure[p])]
insort_left(window, i)
return frauds


Theoretically speaking, the code appears to be an improvement. However, when I tested the performance of my old code versus the new using timeit with 1 million executions, I was surprised to see that the old code completing at 5.35 seconds beats the new one by 2.1 seconds. I am perplexed about what's preventing my new code from performing better than the old one. Any additional tips would be incredibly helpful!







share|improve this question





















  • Is this written for Python 2 or 3?
    – Mast
    Jan 8 at 8:12










  • Can you add the link of the challenge, I can't seem to find it
    – Ludisposed
    Jan 8 at 13:15












up vote
7
down vote

favorite









up vote
7
down vote

favorite











I am self-studying algorithms and data structures. So, bear with me if the code I've written isn't stellar or interpretation is accurate.



Essentially, the problem (which I took from HackerRank) requires counting the number of times an element in an array is greater than or equal to two times the median of the subarray containing the past D elements.



For instance, suppose you have an array of [2,3,4,2,3,6,8,4,5]. And, the window is D = 5.



Starting value to compare against the median of the subarray is 6. So, you
look at the median of the subarray, [2,3,4,2,3], which is 3. Then, you multiply the median by 2, which is 6, and since the 2 x median value satisfies the condition, you would increment a counter by 1. You continue this procedure all the way to the last element of the array.



So, here's the code I have written thus far:



import sys

def activityNotifications(expenditure, d):
""" Worst-case: O( n^2 * logn ) """
num_of_trans = len(expenditure) - 1
count = 0
for i in range(d,num_of_trans):
prev_trans = sorted(expenditure[i-d:i])
if d % 2 == 0:
left_i = (d - 1) // 2
right_i = left_i + 1
median = prev_trans[left_i] + prev_trans[right_i]
else:
median = prev_trans[d - 1]
if expenditure[i] >= median:
count += 1
return count

if __name__ == "__main__":

expenditures = [[2,3,4,2,3,6,8,4,5],[1,2,3,4,4]]
d = 4

for record in expenditures:
result = activityNotifications(record, d)
print(result)


The approach works, but I am afraid that the code is poorly optimized. If my understanding of asymptotic complexity of this algorithm is correct, I believe it's $O(n ^ 2 * log(n))$. Is there any way I can reduce the complexity down to $O(n)$ or even better $O(log(n))$?



** Update (01/11/18) **



Based on Austin Hasting's suggestion, I revised my code to utilize bisect functions to avoid sorting at every new window. Essentially, the first window is sorted. Then, after the median of the sorted array is computed, bisect finds the position of the leftmost element of the window in the sorted array. Based on the position, the element within the sorted array is deleted. Finally, the next element is added onto the sorted array using bisect once again. Based on the procedure I have outlined, here's my new code:



def median(a, d):
if (d % 2) != 0:
return a[d // 2]
else:
half = d // 2
return (a[half] + a[half - 1]) / 2

def activityNotifications_bisect(expenditure, d):
""" Worst-case: O( n ) """

frauds = 0
window = sorted(expenditure[:d])
for p, i in enumerate(expenditure[d:]):
if i >= median(window, d) * 2:
frauds += 1
del window[bisect_left(window, expenditure[p])]
insort_left(window, i)
return frauds


Theoretically speaking, the code appears to be an improvement. However, when I tested the performance of my old code versus the new using timeit with 1 million executions, I was surprised to see that the old code completing at 5.35 seconds beats the new one by 2.1 seconds. I am perplexed about what's preventing my new code from performing better than the old one. Any additional tips would be incredibly helpful!







share|improve this question













I am self-studying algorithms and data structures. So, bear with me if the code I've written isn't stellar or interpretation is accurate.



Essentially, the problem (which I took from HackerRank) requires counting the number of times an element in an array is greater than or equal to two times the median of the subarray containing the past D elements.



For instance, suppose you have an array of [2,3,4,2,3,6,8,4,5]. And, the window is D = 5.



Starting value to compare against the median of the subarray is 6. So, you
look at the median of the subarray, [2,3,4,2,3], which is 3. Then, you multiply the median by 2, which is 6, and since the 2 x median value satisfies the condition, you would increment a counter by 1. You continue this procedure all the way to the last element of the array.



So, here's the code I have written thus far:



import sys

def activityNotifications(expenditure, d):
""" Worst-case: O( n^2 * logn ) """
num_of_trans = len(expenditure) - 1
count = 0
for i in range(d,num_of_trans):
prev_trans = sorted(expenditure[i-d:i])
if d % 2 == 0:
left_i = (d - 1) // 2
right_i = left_i + 1
median = prev_trans[left_i] + prev_trans[right_i]
else:
median = prev_trans[d - 1]
if expenditure[i] >= median:
count += 1
return count

if __name__ == "__main__":

expenditures = [[2,3,4,2,3,6,8,4,5],[1,2,3,4,4]]
d = 4

for record in expenditures:
result = activityNotifications(record, d)
print(result)


The approach works, but I am afraid that the code is poorly optimized. If my understanding of asymptotic complexity of this algorithm is correct, I believe it's $O(n ^ 2 * log(n))$. Is there any way I can reduce the complexity down to $O(n)$ or even better $O(log(n))$?



** Update (01/11/18) **



Based on Austin Hasting's suggestion, I revised my code to utilize bisect functions to avoid sorting at every new window. Essentially, the first window is sorted. Then, after the median of the sorted array is computed, bisect finds the position of the leftmost element of the window in the sorted array. Based on the position, the element within the sorted array is deleted. Finally, the next element is added onto the sorted array using bisect once again. Based on the procedure I have outlined, here's my new code:



def median(a, d):
if (d % 2) != 0:
return a[d // 2]
else:
half = d // 2
return (a[half] + a[half - 1]) / 2

def activityNotifications_bisect(expenditure, d):
""" Worst-case: O( n ) """

frauds = 0
window = sorted(expenditure[:d])
for p, i in enumerate(expenditure[d:]):
if i >= median(window, d) * 2:
frauds += 1
del window[bisect_left(window, expenditure[p])]
insort_left(window, i)
return frauds


Theoretically speaking, the code appears to be an improvement. However, when I tested the performance of my old code versus the new using timeit with 1 million executions, I was surprised to see that the old code completing at 5.35 seconds beats the new one by 2.1 seconds. I am perplexed about what's preventing my new code from performing better than the old one. Any additional tips would be incredibly helpful!









share|improve this question












share|improve this question




share|improve this question








edited Jan 11 at 19:21
























asked Jan 8 at 4:18









daniel lee

485




485











  • Is this written for Python 2 or 3?
    – Mast
    Jan 8 at 8:12










  • Can you add the link of the challenge, I can't seem to find it
    – Ludisposed
    Jan 8 at 13:15
















  • Is this written for Python 2 or 3?
    – Mast
    Jan 8 at 8:12










  • Can you add the link of the challenge, I can't seem to find it
    – Ludisposed
    Jan 8 at 13:15















Is this written for Python 2 or 3?
– Mast
Jan 8 at 8:12




Is this written for Python 2 or 3?
– Mast
Jan 8 at 8:12












Can you add the link of the challenge, I can't seem to find it
– Ludisposed
Jan 8 at 13:15




Can you add the link of the challenge, I can't seem to find it
– Ludisposed
Jan 8 at 13:15










2 Answers
2






active

oldest

votes

















up vote
4
down vote



accepted










You are almost correct.



I believe you are interpreting the performance as $O(n²*logn)$ because you are looping over (most of) the items in the input array (length = n) and you are using a sort to compute your median.



However, sorted's performance is actually determined by the size of the window parameter, D, which is a constant, and its log is a constant, too! So your performance is technically $O(n*DlogD)$, which is $O(n × C)$ which is $O(n)$.



But you can do better!



Assume for a moment that you have the window sorted. You now receive a sequence of pairs (a, z) where value a is to be removed from the window and value z inserted. Once this is done, you will need to recompute the median.



Also, you have two cases: one where the window size D is an odd number and thus the median is one of the values in the window; and the other case where the window size is an even number and the median must be computed as the mean of the central values.



I think that if you separate your code into two functions at the top level - one for even windows, one for odd windows, you can write code that updates the sorted window without having to invoke the sorted function each time through. Thus, instead of D log D you can perform one removal and one insert in a single pass, having time proportional to D.



You'll still pass over nearly all the items in the list, so your time will still be $O(n)$, but it won't be times quite so large a constant.



Edit for @hjpotter92:



You are traversing an unsorted sequence of numbers, using a sliding window:



 D = 4
[a, b, c, d,] e, f, g, h
a, [b, c, d, e,] f, g, h
a, b, [c, d, e, f,] g, h


Each time you advance, your window adds one number (I called it z) on the right side, and loses one number (I called it a) from the left side. So instead of pretending that each window is a whole new bunch of numbers that have to be put into order, using sort, you can write your code to remove one number, insert one number, and then recompute the median.



In some cases (D is odd), the median is just window[D//2]. In other cases (D is even), the median is (window[D//2-1] + window[D//2]) / 2. This is because of how the median is computed. You can see OP doing this computation in the original code. However, there's no reason to keep checking D for odd/even, since it never changes in the function. Instead, it's better to split the function at the top level, and proceed from there:



def func(sequence, d):
if d % 2 == 0: # D is even
# code for scanning with even window size
else: # D is odd
# code for scanning with odd window size


Edit 2: Comprehensive performance test



I made some changes to your code to test various performance enhancements. Here is one set of results run on my 2011 MacBook Pro (numbers are seconds, lower is better):



$ python -V
Python 3.5.4

$ python test.py

With n=10 inputs, d=4
replsort 0.925
replsort2 1.063
linsert 1.549
insbisect 1.102
noparity 0.965
orig 1.129

With n=1000 inputs, d=4
replsort 3.945
replsort2 4.268
linsert 5.210
insbisect 4.280
noparity 4.431
orig 5.324

With n=1000 inputs, d=50
replsort 7.384
replsort2 7.156
linsert 8.763
insbisect 4.736
noparity 10.172
orig 10.854

With n=1000 inputs, d=500
replsort 10.447
replsort2 10.590
linsert 8.184
insbisect 6.641
noparity 50.901
orig 49.458


The orig function is your code. The noparity function is your code with the parity check hoisted to the top. The replsort functions maintain the window by replacing the a value with the z value (as discussed above) and then sorting the window. The linsert and insbisect functions work by doing a linear scan of the window, or by calling bisect.insort_left, respectively, to get the z value in place.



Important: your example is bad. You've got a short list of input numbers, with a small window size. That means the "fixed costs" will tend to dominate your performance.



What does that mean? It means that things like calling the function, initializing count = 0, setting up and tearing down the for loop, and executing the return statement will make a difference.



It also means that the difference between $O(n)$ and $O(log n)$ and $O(n log n)$ don't seem to matter.



But when I switched to generating 1000 random numbers, instead of 10 fixed numbers, things changed! Notice the linsert function goes from being 50% slower, to being something like 25% slower when n=1000. Also, notice that all the other functions beat noparity. I believe this is because noparity always recomputes the entire slice before sorting it. That computation got added in 990 more times (for d=4) and it started to be a drag on the performance.



Finally, I started increasing the window size. This is where the magic really happens, because this affects the performance of the sort, the bisect, the window search and removes. Essentially, this is the number that really determines the differences in performance. For a small window size, again the upfront costs of calling sorted or remove or bisect are dominating. But when the window size goes up, suddenly the differences in algorithmic efficiency start to matter.



And look what happens with d=500. First, realize that setting d=500 means the code only does half the work! Because the input of 1000 minus a 500 window size leaves 500 checks to run, rather than 996 (d=4) or 950 (d=50). But when that change is made, suddenly linsert (which is $O(d)$) and insbisect (which is $O(log d)$) outperform everybody. Because "setup costs" and "written in C" no longer cover up $O(d log d)$!



# test.py
import bisect

def orig(expenditure, d):
""" Worst-case: O( n^2 * logn ) """
num_of_trans = len(expenditure) - 1
count = 0
for i in range(d, num_of_trans):
prev_trans = sorted(expenditure[i - d:i])
if d % 2 == 0:
left_i = (d - 1) // 2
right_i = left_i + 1
median = prev_trans[left_i] + prev_trans[right_i]
else:
median = prev_trans[d - 1]
if expenditure[i] >= median:
count += 1

return count

def noparity(expenditure, d):
"""Hoist parity check out of loop"""

num_of_trans = len(expenditure) - 1
count = 0
left_i = (d - 1) // 2
right_i = left_i + 1

if d % 2 == 0:
for i in range(d, num_of_trans):
prev_trans = sorted(expenditure[i - d:i])
median_x2 = prev_trans[left_i] + prev_trans[right_i]
if expenditure[i] >= median_x2:
count += 1
else:
for i in range(d, num_of_trans):
prev_trans = sorted(expenditure[i - d:i])
median_x2 = prev_trans[d // 2] * 2
if expenditure[i] >= median_x2:
count += 1

return count

def linsert(expenditure, d):
"""Hoisted median, using linear insertion instead of sort"""

count = 0
len_exp = len(expenditure)
window = sorted(expenditure[:d])
winremove = window.remove
winsert = window.insert
winappend = window.append

if d % 2 == 0:
m1 = d // 2
m0 = m1 - 1
ai = 0
for zi in range(d, len_exp):
median_x2 = window[m0] + window[m1]
a, z = expenditure[ai], expenditure[zi]
ai += 1

if z >= median_x2:
count += 1

winremove(a)
for i in range(d-1): # d-1 because .remove
if window[i] <= z:
winsert(i, z)
break
else:
winappend(z)

else:
raise NotImplementedError("I haven't written this code.")

return count

def insbisect(expenditure, d):
"""Hoisted median, using bisect for insertion."""

count = 0
len_exp = len(expenditure)
window = sorted(expenditure[:d])
winremove = window.remove

if d % 2 == 0:
m1 = d // 2
m0 = m1 - 1
ai = 0
for zi in range(d, len_exp):
median_x2 = window[m0] + window[m1]
a, z = expenditure[ai], expenditure[zi]
ai += 1

if z >= median_x2:
count += 1

winremove(a)
bisect.insort_left(window, z)

else:
raise NotImplementedError("I haven't written this code.")

return count

def replsort(expenditure, d):
"""Hoisted median, using replacement via index, then sort."""

count = 0

window = sorted(expenditure[:d])
windex = window.index
winsort = window.sort

if d % 2 == 0:
m1 = d // 2
m0 = m1 - 1

for a, z in zip(expenditure, expenditure[d:]):
median_x2 = window[m0] + window[m1]
if z >= median_x2:
count += 1

# Note: do not write `window =` anywhere, or you invalidate
# windex and winsort.
window[windex(a)] = z
winsort()
else:
raise NotImplementedError("I haven't written this code.")

return count

def replsort2(expenditure, d):
"""Hoisted median, using replacement via index, then sort."""

count = 0

len_exp = len(expenditure)
window = sorted(expenditure[:d])
windex = window.index
winsort = window.sort

if d % 2 == 0:

m1 = d // 2
m0 = m1 - 1
ai = 0

for zi in range(d, len_exp):
median_x2 = window[m0] + window[m1]
a, z = expenditure[ai], expenditure[zi]
ai += 1

if z >= median_x2:
count += 1

# Note: do not write `window =` anywhere, or you invalidate
# windex and winsort.
window[windex(a)] = z
winsort()
else:
raise NotImplementedError("I haven't written this code.")

return count

if __name__ == "__main__":
from timeit import timeit
from random import randrange as rr

results =[line.strip().split() for line in """
replsort 9.730
replsort2 11.538
linsert 17.954
insbisect 10.970
noparity 10.574
orig 11.801
""".splitlines() if line.strip()]

print("nWith n=10 inputs, d=4")
for funcname, time_for_me in results:
print(" :20s :5.3f".format(funcname,
timeit('([2,3,4,2,3,6,8,4,5], 4)'.format(funcname),
setup='from __main__ import '.format(funcname),
number=100000)))

print("nWith n=1000 inputs, d=4")

for funcname, time_for_me in results:
print(" :20s :5.3f".format(funcname,
timeit('([rr(1, 16) for _ in range(1000)], 4)'.format(funcname),
setup='from __main__ import rr, '.format(funcname),
number=1000)))


print("nWith n=1000 inputs, d=50")

for funcname, time_for_me in results:
print(" :20s :5.3f".format(funcname,
timeit('([rr(1, 16) for _ in range(1000)], 50)'.format(funcname),
setup='from __main__ import rr, '.format(funcname),
number=1000)))





share|improve this answer























  • Can you explain in a bit more detail what you mean by the a, z pair; the window and computation of median?
    – hjpotter92
    Jan 8 at 9:38










  • @Austin Hastings, based on your suggestion, I have made some revision to my code. Please let me know what you think. I am not getting the performance boost as I expected.
    – daniel lee
    Jan 11 at 19:23






  • 1




    First off, your median computation is still checking the parity of the window during each call. My suggestion was for you to do a check at the top of your main function (activityNotifications_bisect now) one time, and then dispatch to code that would compute the median correctly. I don't think you need a separate median function at all, if you want performance.
    – Austin Hastings
    Jan 11 at 19:28






  • 1




    Updated again. Note the effect of increasing d...
    – Austin Hastings
    Jan 11 at 21:54






  • 1




    Note that you might be able to squeeze some more performance out of insbisect using bisect.bisect_left and del instead of window.remove. I would expect it to win at d=500, not sure about 50, probably not at 4.
    – Austin Hastings
    Jan 11 at 21:59

















up vote
0
down vote














2X greater than or equal to the median




A more precise formulation would be "greater than or equal to twice the median"




I believe it's O(n2∗log(n)).




There is no n given in this problem. The n in O(n) isn't just some dummy variable; it refers to an actual parameter in the problem space. Saying "the complexity is O(n)", when you haven't said what n is, is meaningless.



Every time you move the window, all but one element is already sorted. You can do a binary search to find where to insert the new element, giving O(log(D)) complexity. For D = 4, that won't save much time and probably isn't worth the overhead, but for large D it would be a significant savings.






share|improve this answer





















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f184553%2fcount-how-many-times-a-value-is-greater-than-or-equal-to-the-median-of-previous%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    4
    down vote



    accepted










    You are almost correct.



    I believe you are interpreting the performance as $O(n²*logn)$ because you are looping over (most of) the items in the input array (length = n) and you are using a sort to compute your median.



    However, sorted's performance is actually determined by the size of the window parameter, D, which is a constant, and its log is a constant, too! So your performance is technically $O(n*DlogD)$, which is $O(n × C)$ which is $O(n)$.



    But you can do better!



    Assume for a moment that you have the window sorted. You now receive a sequence of pairs (a, z) where value a is to be removed from the window and value z inserted. Once this is done, you will need to recompute the median.



    Also, you have two cases: one where the window size D is an odd number and thus the median is one of the values in the window; and the other case where the window size is an even number and the median must be computed as the mean of the central values.



    I think that if you separate your code into two functions at the top level - one for even windows, one for odd windows, you can write code that updates the sorted window without having to invoke the sorted function each time through. Thus, instead of D log D you can perform one removal and one insert in a single pass, having time proportional to D.



    You'll still pass over nearly all the items in the list, so your time will still be $O(n)$, but it won't be times quite so large a constant.



    Edit for @hjpotter92:



    You are traversing an unsorted sequence of numbers, using a sliding window:



     D = 4
    [a, b, c, d,] e, f, g, h
    a, [b, c, d, e,] f, g, h
    a, b, [c, d, e, f,] g, h


    Each time you advance, your window adds one number (I called it z) on the right side, and loses one number (I called it a) from the left side. So instead of pretending that each window is a whole new bunch of numbers that have to be put into order, using sort, you can write your code to remove one number, insert one number, and then recompute the median.



    In some cases (D is odd), the median is just window[D//2]. In other cases (D is even), the median is (window[D//2-1] + window[D//2]) / 2. This is because of how the median is computed. You can see OP doing this computation in the original code. However, there's no reason to keep checking D for odd/even, since it never changes in the function. Instead, it's better to split the function at the top level, and proceed from there:



    def func(sequence, d):
    if d % 2 == 0: # D is even
    # code for scanning with even window size
    else: # D is odd
    # code for scanning with odd window size


    Edit 2: Comprehensive performance test



    I made some changes to your code to test various performance enhancements. Here is one set of results run on my 2011 MacBook Pro (numbers are seconds, lower is better):



    $ python -V
    Python 3.5.4

    $ python test.py

    With n=10 inputs, d=4
    replsort 0.925
    replsort2 1.063
    linsert 1.549
    insbisect 1.102
    noparity 0.965
    orig 1.129

    With n=1000 inputs, d=4
    replsort 3.945
    replsort2 4.268
    linsert 5.210
    insbisect 4.280
    noparity 4.431
    orig 5.324

    With n=1000 inputs, d=50
    replsort 7.384
    replsort2 7.156
    linsert 8.763
    insbisect 4.736
    noparity 10.172
    orig 10.854

    With n=1000 inputs, d=500
    replsort 10.447
    replsort2 10.590
    linsert 8.184
    insbisect 6.641
    noparity 50.901
    orig 49.458


    The orig function is your code. The noparity function is your code with the parity check hoisted to the top. The replsort functions maintain the window by replacing the a value with the z value (as discussed above) and then sorting the window. The linsert and insbisect functions work by doing a linear scan of the window, or by calling bisect.insort_left, respectively, to get the z value in place.



    Important: your example is bad. You've got a short list of input numbers, with a small window size. That means the "fixed costs" will tend to dominate your performance.



    What does that mean? It means that things like calling the function, initializing count = 0, setting up and tearing down the for loop, and executing the return statement will make a difference.



    It also means that the difference between $O(n)$ and $O(log n)$ and $O(n log n)$ don't seem to matter.



    But when I switched to generating 1000 random numbers, instead of 10 fixed numbers, things changed! Notice the linsert function goes from being 50% slower, to being something like 25% slower when n=1000. Also, notice that all the other functions beat noparity. I believe this is because noparity always recomputes the entire slice before sorting it. That computation got added in 990 more times (for d=4) and it started to be a drag on the performance.



    Finally, I started increasing the window size. This is where the magic really happens, because this affects the performance of the sort, the bisect, the window search and removes. Essentially, this is the number that really determines the differences in performance. For a small window size, again the upfront costs of calling sorted or remove or bisect are dominating. But when the window size goes up, suddenly the differences in algorithmic efficiency start to matter.



    And look what happens with d=500. First, realize that setting d=500 means the code only does half the work! Because the input of 1000 minus a 500 window size leaves 500 checks to run, rather than 996 (d=4) or 950 (d=50). But when that change is made, suddenly linsert (which is $O(d)$) and insbisect (which is $O(log d)$) outperform everybody. Because "setup costs" and "written in C" no longer cover up $O(d log d)$!



    # test.py
    import bisect

    def orig(expenditure, d):
    """ Worst-case: O( n^2 * logn ) """
    num_of_trans = len(expenditure) - 1
    count = 0
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    if d % 2 == 0:
    left_i = (d - 1) // 2
    right_i = left_i + 1
    median = prev_trans[left_i] + prev_trans[right_i]
    else:
    median = prev_trans[d - 1]
    if expenditure[i] >= median:
    count += 1

    return count

    def noparity(expenditure, d):
    """Hoist parity check out of loop"""

    num_of_trans = len(expenditure) - 1
    count = 0
    left_i = (d - 1) // 2
    right_i = left_i + 1

    if d % 2 == 0:
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    median_x2 = prev_trans[left_i] + prev_trans[right_i]
    if expenditure[i] >= median_x2:
    count += 1
    else:
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    median_x2 = prev_trans[d // 2] * 2
    if expenditure[i] >= median_x2:
    count += 1

    return count

    def linsert(expenditure, d):
    """Hoisted median, using linear insertion instead of sort"""

    count = 0
    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    winremove = window.remove
    winsert = window.insert
    winappend = window.append

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1
    ai = 0
    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    winremove(a)
    for i in range(d-1): # d-1 because .remove
    if window[i] <= z:
    winsert(i, z)
    break
    else:
    winappend(z)

    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def insbisect(expenditure, d):
    """Hoisted median, using bisect for insertion."""

    count = 0
    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    winremove = window.remove

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1
    ai = 0
    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    winremove(a)
    bisect.insort_left(window, z)

    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def replsort(expenditure, d):
    """Hoisted median, using replacement via index, then sort."""

    count = 0

    window = sorted(expenditure[:d])
    windex = window.index
    winsort = window.sort

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1

    for a, z in zip(expenditure, expenditure[d:]):
    median_x2 = window[m0] + window[m1]
    if z >= median_x2:
    count += 1

    # Note: do not write `window =` anywhere, or you invalidate
    # windex and winsort.
    window[windex(a)] = z
    winsort()
    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def replsort2(expenditure, d):
    """Hoisted median, using replacement via index, then sort."""

    count = 0

    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    windex = window.index
    winsort = window.sort

    if d % 2 == 0:

    m1 = d // 2
    m0 = m1 - 1
    ai = 0

    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    # Note: do not write `window =` anywhere, or you invalidate
    # windex and winsort.
    window[windex(a)] = z
    winsort()
    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    if __name__ == "__main__":
    from timeit import timeit
    from random import randrange as rr

    results =[line.strip().split() for line in """
    replsort 9.730
    replsort2 11.538
    linsert 17.954
    insbisect 10.970
    noparity 10.574
    orig 11.801
    """.splitlines() if line.strip()]

    print("nWith n=10 inputs, d=4")
    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([2,3,4,2,3,6,8,4,5], 4)'.format(funcname),
    setup='from __main__ import '.format(funcname),
    number=100000)))

    print("nWith n=1000 inputs, d=4")

    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([rr(1, 16) for _ in range(1000)], 4)'.format(funcname),
    setup='from __main__ import rr, '.format(funcname),
    number=1000)))


    print("nWith n=1000 inputs, d=50")

    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([rr(1, 16) for _ in range(1000)], 50)'.format(funcname),
    setup='from __main__ import rr, '.format(funcname),
    number=1000)))





    share|improve this answer























    • Can you explain in a bit more detail what you mean by the a, z pair; the window and computation of median?
      – hjpotter92
      Jan 8 at 9:38










    • @Austin Hastings, based on your suggestion, I have made some revision to my code. Please let me know what you think. I am not getting the performance boost as I expected.
      – daniel lee
      Jan 11 at 19:23






    • 1




      First off, your median computation is still checking the parity of the window during each call. My suggestion was for you to do a check at the top of your main function (activityNotifications_bisect now) one time, and then dispatch to code that would compute the median correctly. I don't think you need a separate median function at all, if you want performance.
      – Austin Hastings
      Jan 11 at 19:28






    • 1




      Updated again. Note the effect of increasing d...
      – Austin Hastings
      Jan 11 at 21:54






    • 1




      Note that you might be able to squeeze some more performance out of insbisect using bisect.bisect_left and del instead of window.remove. I would expect it to win at d=500, not sure about 50, probably not at 4.
      – Austin Hastings
      Jan 11 at 21:59














    up vote
    4
    down vote



    accepted










    You are almost correct.



    I believe you are interpreting the performance as $O(n²*logn)$ because you are looping over (most of) the items in the input array (length = n) and you are using a sort to compute your median.



    However, sorted's performance is actually determined by the size of the window parameter, D, which is a constant, and its log is a constant, too! So your performance is technically $O(n*DlogD)$, which is $O(n × C)$ which is $O(n)$.



    But you can do better!



    Assume for a moment that you have the window sorted. You now receive a sequence of pairs (a, z) where value a is to be removed from the window and value z inserted. Once this is done, you will need to recompute the median.



    Also, you have two cases: one where the window size D is an odd number and thus the median is one of the values in the window; and the other case where the window size is an even number and the median must be computed as the mean of the central values.



    I think that if you separate your code into two functions at the top level - one for even windows, one for odd windows, you can write code that updates the sorted window without having to invoke the sorted function each time through. Thus, instead of D log D you can perform one removal and one insert in a single pass, having time proportional to D.



    You'll still pass over nearly all the items in the list, so your time will still be $O(n)$, but it won't be times quite so large a constant.



    Edit for @hjpotter92:



    You are traversing an unsorted sequence of numbers, using a sliding window:



     D = 4
    [a, b, c, d,] e, f, g, h
    a, [b, c, d, e,] f, g, h
    a, b, [c, d, e, f,] g, h


    Each time you advance, your window adds one number (I called it z) on the right side, and loses one number (I called it a) from the left side. So instead of pretending that each window is a whole new bunch of numbers that have to be put into order, using sort, you can write your code to remove one number, insert one number, and then recompute the median.



    In some cases (D is odd), the median is just window[D//2]. In other cases (D is even), the median is (window[D//2-1] + window[D//2]) / 2. This is because of how the median is computed. You can see OP doing this computation in the original code. However, there's no reason to keep checking D for odd/even, since it never changes in the function. Instead, it's better to split the function at the top level, and proceed from there:



    def func(sequence, d):
    if d % 2 == 0: # D is even
    # code for scanning with even window size
    else: # D is odd
    # code for scanning with odd window size


    Edit 2: Comprehensive performance test



    I made some changes to your code to test various performance enhancements. Here is one set of results run on my 2011 MacBook Pro (numbers are seconds, lower is better):



    $ python -V
    Python 3.5.4

    $ python test.py

    With n=10 inputs, d=4
    replsort 0.925
    replsort2 1.063
    linsert 1.549
    insbisect 1.102
    noparity 0.965
    orig 1.129

    With n=1000 inputs, d=4
    replsort 3.945
    replsort2 4.268
    linsert 5.210
    insbisect 4.280
    noparity 4.431
    orig 5.324

    With n=1000 inputs, d=50
    replsort 7.384
    replsort2 7.156
    linsert 8.763
    insbisect 4.736
    noparity 10.172
    orig 10.854

    With n=1000 inputs, d=500
    replsort 10.447
    replsort2 10.590
    linsert 8.184
    insbisect 6.641
    noparity 50.901
    orig 49.458


    The orig function is your code. The noparity function is your code with the parity check hoisted to the top. The replsort functions maintain the window by replacing the a value with the z value (as discussed above) and then sorting the window. The linsert and insbisect functions work by doing a linear scan of the window, or by calling bisect.insort_left, respectively, to get the z value in place.



    Important: your example is bad. You've got a short list of input numbers, with a small window size. That means the "fixed costs" will tend to dominate your performance.



    What does that mean? It means that things like calling the function, initializing count = 0, setting up and tearing down the for loop, and executing the return statement will make a difference.



    It also means that the difference between $O(n)$ and $O(log n)$ and $O(n log n)$ don't seem to matter.



    But when I switched to generating 1000 random numbers, instead of 10 fixed numbers, things changed! Notice the linsert function goes from being 50% slower, to being something like 25% slower when n=1000. Also, notice that all the other functions beat noparity. I believe this is because noparity always recomputes the entire slice before sorting it. That computation got added in 990 more times (for d=4) and it started to be a drag on the performance.



    Finally, I started increasing the window size. This is where the magic really happens, because this affects the performance of the sort, the bisect, the window search and removes. Essentially, this is the number that really determines the differences in performance. For a small window size, again the upfront costs of calling sorted or remove or bisect are dominating. But when the window size goes up, suddenly the differences in algorithmic efficiency start to matter.



    And look what happens with d=500. First, realize that setting d=500 means the code only does half the work! Because the input of 1000 minus a 500 window size leaves 500 checks to run, rather than 996 (d=4) or 950 (d=50). But when that change is made, suddenly linsert (which is $O(d)$) and insbisect (which is $O(log d)$) outperform everybody. Because "setup costs" and "written in C" no longer cover up $O(d log d)$!



    # test.py
    import bisect

    def orig(expenditure, d):
    """ Worst-case: O( n^2 * logn ) """
    num_of_trans = len(expenditure) - 1
    count = 0
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    if d % 2 == 0:
    left_i = (d - 1) // 2
    right_i = left_i + 1
    median = prev_trans[left_i] + prev_trans[right_i]
    else:
    median = prev_trans[d - 1]
    if expenditure[i] >= median:
    count += 1

    return count

    def noparity(expenditure, d):
    """Hoist parity check out of loop"""

    num_of_trans = len(expenditure) - 1
    count = 0
    left_i = (d - 1) // 2
    right_i = left_i + 1

    if d % 2 == 0:
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    median_x2 = prev_trans[left_i] + prev_trans[right_i]
    if expenditure[i] >= median_x2:
    count += 1
    else:
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    median_x2 = prev_trans[d // 2] * 2
    if expenditure[i] >= median_x2:
    count += 1

    return count

    def linsert(expenditure, d):
    """Hoisted median, using linear insertion instead of sort"""

    count = 0
    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    winremove = window.remove
    winsert = window.insert
    winappend = window.append

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1
    ai = 0
    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    winremove(a)
    for i in range(d-1): # d-1 because .remove
    if window[i] <= z:
    winsert(i, z)
    break
    else:
    winappend(z)

    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def insbisect(expenditure, d):
    """Hoisted median, using bisect for insertion."""

    count = 0
    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    winremove = window.remove

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1
    ai = 0
    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    winremove(a)
    bisect.insort_left(window, z)

    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def replsort(expenditure, d):
    """Hoisted median, using replacement via index, then sort."""

    count = 0

    window = sorted(expenditure[:d])
    windex = window.index
    winsort = window.sort

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1

    for a, z in zip(expenditure, expenditure[d:]):
    median_x2 = window[m0] + window[m1]
    if z >= median_x2:
    count += 1

    # Note: do not write `window =` anywhere, or you invalidate
    # windex and winsort.
    window[windex(a)] = z
    winsort()
    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def replsort2(expenditure, d):
    """Hoisted median, using replacement via index, then sort."""

    count = 0

    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    windex = window.index
    winsort = window.sort

    if d % 2 == 0:

    m1 = d // 2
    m0 = m1 - 1
    ai = 0

    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    # Note: do not write `window =` anywhere, or you invalidate
    # windex and winsort.
    window[windex(a)] = z
    winsort()
    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    if __name__ == "__main__":
    from timeit import timeit
    from random import randrange as rr

    results =[line.strip().split() for line in """
    replsort 9.730
    replsort2 11.538
    linsert 17.954
    insbisect 10.970
    noparity 10.574
    orig 11.801
    """.splitlines() if line.strip()]

    print("nWith n=10 inputs, d=4")
    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([2,3,4,2,3,6,8,4,5], 4)'.format(funcname),
    setup='from __main__ import '.format(funcname),
    number=100000)))

    print("nWith n=1000 inputs, d=4")

    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([rr(1, 16) for _ in range(1000)], 4)'.format(funcname),
    setup='from __main__ import rr, '.format(funcname),
    number=1000)))


    print("nWith n=1000 inputs, d=50")

    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([rr(1, 16) for _ in range(1000)], 50)'.format(funcname),
    setup='from __main__ import rr, '.format(funcname),
    number=1000)))





    share|improve this answer























    • Can you explain in a bit more detail what you mean by the a, z pair; the window and computation of median?
      – hjpotter92
      Jan 8 at 9:38










    • @Austin Hastings, based on your suggestion, I have made some revision to my code. Please let me know what you think. I am not getting the performance boost as I expected.
      – daniel lee
      Jan 11 at 19:23






    • 1




      First off, your median computation is still checking the parity of the window during each call. My suggestion was for you to do a check at the top of your main function (activityNotifications_bisect now) one time, and then dispatch to code that would compute the median correctly. I don't think you need a separate median function at all, if you want performance.
      – Austin Hastings
      Jan 11 at 19:28






    • 1




      Updated again. Note the effect of increasing d...
      – Austin Hastings
      Jan 11 at 21:54






    • 1




      Note that you might be able to squeeze some more performance out of insbisect using bisect.bisect_left and del instead of window.remove. I would expect it to win at d=500, not sure about 50, probably not at 4.
      – Austin Hastings
      Jan 11 at 21:59












    up vote
    4
    down vote



    accepted







    up vote
    4
    down vote



    accepted






    You are almost correct.



    I believe you are interpreting the performance as $O(n²*logn)$ because you are looping over (most of) the items in the input array (length = n) and you are using a sort to compute your median.



    However, sorted's performance is actually determined by the size of the window parameter, D, which is a constant, and its log is a constant, too! So your performance is technically $O(n*DlogD)$, which is $O(n × C)$ which is $O(n)$.



    But you can do better!



    Assume for a moment that you have the window sorted. You now receive a sequence of pairs (a, z) where value a is to be removed from the window and value z inserted. Once this is done, you will need to recompute the median.



    Also, you have two cases: one where the window size D is an odd number and thus the median is one of the values in the window; and the other case where the window size is an even number and the median must be computed as the mean of the central values.



    I think that if you separate your code into two functions at the top level - one for even windows, one for odd windows, you can write code that updates the sorted window without having to invoke the sorted function each time through. Thus, instead of D log D you can perform one removal and one insert in a single pass, having time proportional to D.



    You'll still pass over nearly all the items in the list, so your time will still be $O(n)$, but it won't be times quite so large a constant.



    Edit for @hjpotter92:



    You are traversing an unsorted sequence of numbers, using a sliding window:



     D = 4
    [a, b, c, d,] e, f, g, h
    a, [b, c, d, e,] f, g, h
    a, b, [c, d, e, f,] g, h


    Each time you advance, your window adds one number (I called it z) on the right side, and loses one number (I called it a) from the left side. So instead of pretending that each window is a whole new bunch of numbers that have to be put into order, using sort, you can write your code to remove one number, insert one number, and then recompute the median.



    In some cases (D is odd), the median is just window[D//2]. In other cases (D is even), the median is (window[D//2-1] + window[D//2]) / 2. This is because of how the median is computed. You can see OP doing this computation in the original code. However, there's no reason to keep checking D for odd/even, since it never changes in the function. Instead, it's better to split the function at the top level, and proceed from there:



    def func(sequence, d):
    if d % 2 == 0: # D is even
    # code for scanning with even window size
    else: # D is odd
    # code for scanning with odd window size


    Edit 2: Comprehensive performance test



    I made some changes to your code to test various performance enhancements. Here is one set of results run on my 2011 MacBook Pro (numbers are seconds, lower is better):



    $ python -V
    Python 3.5.4

    $ python test.py

    With n=10 inputs, d=4
    replsort 0.925
    replsort2 1.063
    linsert 1.549
    insbisect 1.102
    noparity 0.965
    orig 1.129

    With n=1000 inputs, d=4
    replsort 3.945
    replsort2 4.268
    linsert 5.210
    insbisect 4.280
    noparity 4.431
    orig 5.324

    With n=1000 inputs, d=50
    replsort 7.384
    replsort2 7.156
    linsert 8.763
    insbisect 4.736
    noparity 10.172
    orig 10.854

    With n=1000 inputs, d=500
    replsort 10.447
    replsort2 10.590
    linsert 8.184
    insbisect 6.641
    noparity 50.901
    orig 49.458


    The orig function is your code. The noparity function is your code with the parity check hoisted to the top. The replsort functions maintain the window by replacing the a value with the z value (as discussed above) and then sorting the window. The linsert and insbisect functions work by doing a linear scan of the window, or by calling bisect.insort_left, respectively, to get the z value in place.



    Important: your example is bad. You've got a short list of input numbers, with a small window size. That means the "fixed costs" will tend to dominate your performance.



    What does that mean? It means that things like calling the function, initializing count = 0, setting up and tearing down the for loop, and executing the return statement will make a difference.



    It also means that the difference between $O(n)$ and $O(log n)$ and $O(n log n)$ don't seem to matter.



    But when I switched to generating 1000 random numbers, instead of 10 fixed numbers, things changed! Notice the linsert function goes from being 50% slower, to being something like 25% slower when n=1000. Also, notice that all the other functions beat noparity. I believe this is because noparity always recomputes the entire slice before sorting it. That computation got added in 990 more times (for d=4) and it started to be a drag on the performance.



    Finally, I started increasing the window size. This is where the magic really happens, because this affects the performance of the sort, the bisect, the window search and removes. Essentially, this is the number that really determines the differences in performance. For a small window size, again the upfront costs of calling sorted or remove or bisect are dominating. But when the window size goes up, suddenly the differences in algorithmic efficiency start to matter.



    And look what happens with d=500. First, realize that setting d=500 means the code only does half the work! Because the input of 1000 minus a 500 window size leaves 500 checks to run, rather than 996 (d=4) or 950 (d=50). But when that change is made, suddenly linsert (which is $O(d)$) and insbisect (which is $O(log d)$) outperform everybody. Because "setup costs" and "written in C" no longer cover up $O(d log d)$!



    # test.py
    import bisect

    def orig(expenditure, d):
    """ Worst-case: O( n^2 * logn ) """
    num_of_trans = len(expenditure) - 1
    count = 0
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    if d % 2 == 0:
    left_i = (d - 1) // 2
    right_i = left_i + 1
    median = prev_trans[left_i] + prev_trans[right_i]
    else:
    median = prev_trans[d - 1]
    if expenditure[i] >= median:
    count += 1

    return count

    def noparity(expenditure, d):
    """Hoist parity check out of loop"""

    num_of_trans = len(expenditure) - 1
    count = 0
    left_i = (d - 1) // 2
    right_i = left_i + 1

    if d % 2 == 0:
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    median_x2 = prev_trans[left_i] + prev_trans[right_i]
    if expenditure[i] >= median_x2:
    count += 1
    else:
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    median_x2 = prev_trans[d // 2] * 2
    if expenditure[i] >= median_x2:
    count += 1

    return count

    def linsert(expenditure, d):
    """Hoisted median, using linear insertion instead of sort"""

    count = 0
    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    winremove = window.remove
    winsert = window.insert
    winappend = window.append

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1
    ai = 0
    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    winremove(a)
    for i in range(d-1): # d-1 because .remove
    if window[i] <= z:
    winsert(i, z)
    break
    else:
    winappend(z)

    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def insbisect(expenditure, d):
    """Hoisted median, using bisect for insertion."""

    count = 0
    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    winremove = window.remove

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1
    ai = 0
    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    winremove(a)
    bisect.insort_left(window, z)

    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def replsort(expenditure, d):
    """Hoisted median, using replacement via index, then sort."""

    count = 0

    window = sorted(expenditure[:d])
    windex = window.index
    winsort = window.sort

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1

    for a, z in zip(expenditure, expenditure[d:]):
    median_x2 = window[m0] + window[m1]
    if z >= median_x2:
    count += 1

    # Note: do not write `window =` anywhere, or you invalidate
    # windex and winsort.
    window[windex(a)] = z
    winsort()
    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def replsort2(expenditure, d):
    """Hoisted median, using replacement via index, then sort."""

    count = 0

    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    windex = window.index
    winsort = window.sort

    if d % 2 == 0:

    m1 = d // 2
    m0 = m1 - 1
    ai = 0

    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    # Note: do not write `window =` anywhere, or you invalidate
    # windex and winsort.
    window[windex(a)] = z
    winsort()
    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    if __name__ == "__main__":
    from timeit import timeit
    from random import randrange as rr

    results =[line.strip().split() for line in """
    replsort 9.730
    replsort2 11.538
    linsert 17.954
    insbisect 10.970
    noparity 10.574
    orig 11.801
    """.splitlines() if line.strip()]

    print("nWith n=10 inputs, d=4")
    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([2,3,4,2,3,6,8,4,5], 4)'.format(funcname),
    setup='from __main__ import '.format(funcname),
    number=100000)))

    print("nWith n=1000 inputs, d=4")

    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([rr(1, 16) for _ in range(1000)], 4)'.format(funcname),
    setup='from __main__ import rr, '.format(funcname),
    number=1000)))


    print("nWith n=1000 inputs, d=50")

    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([rr(1, 16) for _ in range(1000)], 50)'.format(funcname),
    setup='from __main__ import rr, '.format(funcname),
    number=1000)))





    share|improve this answer















    You are almost correct.



    I believe you are interpreting the performance as $O(n²*logn)$ because you are looping over (most of) the items in the input array (length = n) and you are using a sort to compute your median.



    However, sorted's performance is actually determined by the size of the window parameter, D, which is a constant, and its log is a constant, too! So your performance is technically $O(n*DlogD)$, which is $O(n × C)$ which is $O(n)$.



    But you can do better!



    Assume for a moment that you have the window sorted. You now receive a sequence of pairs (a, z) where value a is to be removed from the window and value z inserted. Once this is done, you will need to recompute the median.



    Also, you have two cases: one where the window size D is an odd number and thus the median is one of the values in the window; and the other case where the window size is an even number and the median must be computed as the mean of the central values.



    I think that if you separate your code into two functions at the top level - one for even windows, one for odd windows, you can write code that updates the sorted window without having to invoke the sorted function each time through. Thus, instead of D log D you can perform one removal and one insert in a single pass, having time proportional to D.



    You'll still pass over nearly all the items in the list, so your time will still be $O(n)$, but it won't be times quite so large a constant.



    Edit for @hjpotter92:



    You are traversing an unsorted sequence of numbers, using a sliding window:



     D = 4
    [a, b, c, d,] e, f, g, h
    a, [b, c, d, e,] f, g, h
    a, b, [c, d, e, f,] g, h


    Each time you advance, your window adds one number (I called it z) on the right side, and loses one number (I called it a) from the left side. So instead of pretending that each window is a whole new bunch of numbers that have to be put into order, using sort, you can write your code to remove one number, insert one number, and then recompute the median.



    In some cases (D is odd), the median is just window[D//2]. In other cases (D is even), the median is (window[D//2-1] + window[D//2]) / 2. This is because of how the median is computed. You can see OP doing this computation in the original code. However, there's no reason to keep checking D for odd/even, since it never changes in the function. Instead, it's better to split the function at the top level, and proceed from there:



    def func(sequence, d):
    if d % 2 == 0: # D is even
    # code for scanning with even window size
    else: # D is odd
    # code for scanning with odd window size


    Edit 2: Comprehensive performance test



    I made some changes to your code to test various performance enhancements. Here is one set of results run on my 2011 MacBook Pro (numbers are seconds, lower is better):



    $ python -V
    Python 3.5.4

    $ python test.py

    With n=10 inputs, d=4
    replsort 0.925
    replsort2 1.063
    linsert 1.549
    insbisect 1.102
    noparity 0.965
    orig 1.129

    With n=1000 inputs, d=4
    replsort 3.945
    replsort2 4.268
    linsert 5.210
    insbisect 4.280
    noparity 4.431
    orig 5.324

    With n=1000 inputs, d=50
    replsort 7.384
    replsort2 7.156
    linsert 8.763
    insbisect 4.736
    noparity 10.172
    orig 10.854

    With n=1000 inputs, d=500
    replsort 10.447
    replsort2 10.590
    linsert 8.184
    insbisect 6.641
    noparity 50.901
    orig 49.458


    The orig function is your code. The noparity function is your code with the parity check hoisted to the top. The replsort functions maintain the window by replacing the a value with the z value (as discussed above) and then sorting the window. The linsert and insbisect functions work by doing a linear scan of the window, or by calling bisect.insort_left, respectively, to get the z value in place.



    Important: your example is bad. You've got a short list of input numbers, with a small window size. That means the "fixed costs" will tend to dominate your performance.



    What does that mean? It means that things like calling the function, initializing count = 0, setting up and tearing down the for loop, and executing the return statement will make a difference.



    It also means that the difference between $O(n)$ and $O(log n)$ and $O(n log n)$ don't seem to matter.



    But when I switched to generating 1000 random numbers, instead of 10 fixed numbers, things changed! Notice the linsert function goes from being 50% slower, to being something like 25% slower when n=1000. Also, notice that all the other functions beat noparity. I believe this is because noparity always recomputes the entire slice before sorting it. That computation got added in 990 more times (for d=4) and it started to be a drag on the performance.



    Finally, I started increasing the window size. This is where the magic really happens, because this affects the performance of the sort, the bisect, the window search and removes. Essentially, this is the number that really determines the differences in performance. For a small window size, again the upfront costs of calling sorted or remove or bisect are dominating. But when the window size goes up, suddenly the differences in algorithmic efficiency start to matter.



    And look what happens with d=500. First, realize that setting d=500 means the code only does half the work! Because the input of 1000 minus a 500 window size leaves 500 checks to run, rather than 996 (d=4) or 950 (d=50). But when that change is made, suddenly linsert (which is $O(d)$) and insbisect (which is $O(log d)$) outperform everybody. Because "setup costs" and "written in C" no longer cover up $O(d log d)$!



    # test.py
    import bisect

    def orig(expenditure, d):
    """ Worst-case: O( n^2 * logn ) """
    num_of_trans = len(expenditure) - 1
    count = 0
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    if d % 2 == 0:
    left_i = (d - 1) // 2
    right_i = left_i + 1
    median = prev_trans[left_i] + prev_trans[right_i]
    else:
    median = prev_trans[d - 1]
    if expenditure[i] >= median:
    count += 1

    return count

    def noparity(expenditure, d):
    """Hoist parity check out of loop"""

    num_of_trans = len(expenditure) - 1
    count = 0
    left_i = (d - 1) // 2
    right_i = left_i + 1

    if d % 2 == 0:
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    median_x2 = prev_trans[left_i] + prev_trans[right_i]
    if expenditure[i] >= median_x2:
    count += 1
    else:
    for i in range(d, num_of_trans):
    prev_trans = sorted(expenditure[i - d:i])
    median_x2 = prev_trans[d // 2] * 2
    if expenditure[i] >= median_x2:
    count += 1

    return count

    def linsert(expenditure, d):
    """Hoisted median, using linear insertion instead of sort"""

    count = 0
    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    winremove = window.remove
    winsert = window.insert
    winappend = window.append

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1
    ai = 0
    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    winremove(a)
    for i in range(d-1): # d-1 because .remove
    if window[i] <= z:
    winsert(i, z)
    break
    else:
    winappend(z)

    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def insbisect(expenditure, d):
    """Hoisted median, using bisect for insertion."""

    count = 0
    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    winremove = window.remove

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1
    ai = 0
    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    winremove(a)
    bisect.insort_left(window, z)

    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def replsort(expenditure, d):
    """Hoisted median, using replacement via index, then sort."""

    count = 0

    window = sorted(expenditure[:d])
    windex = window.index
    winsort = window.sort

    if d % 2 == 0:
    m1 = d // 2
    m0 = m1 - 1

    for a, z in zip(expenditure, expenditure[d:]):
    median_x2 = window[m0] + window[m1]
    if z >= median_x2:
    count += 1

    # Note: do not write `window =` anywhere, or you invalidate
    # windex and winsort.
    window[windex(a)] = z
    winsort()
    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    def replsort2(expenditure, d):
    """Hoisted median, using replacement via index, then sort."""

    count = 0

    len_exp = len(expenditure)
    window = sorted(expenditure[:d])
    windex = window.index
    winsort = window.sort

    if d % 2 == 0:

    m1 = d // 2
    m0 = m1 - 1
    ai = 0

    for zi in range(d, len_exp):
    median_x2 = window[m0] + window[m1]
    a, z = expenditure[ai], expenditure[zi]
    ai += 1

    if z >= median_x2:
    count += 1

    # Note: do not write `window =` anywhere, or you invalidate
    # windex and winsort.
    window[windex(a)] = z
    winsort()
    else:
    raise NotImplementedError("I haven't written this code.")

    return count

    if __name__ == "__main__":
    from timeit import timeit
    from random import randrange as rr

    results =[line.strip().split() for line in """
    replsort 9.730
    replsort2 11.538
    linsert 17.954
    insbisect 10.970
    noparity 10.574
    orig 11.801
    """.splitlines() if line.strip()]

    print("nWith n=10 inputs, d=4")
    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([2,3,4,2,3,6,8,4,5], 4)'.format(funcname),
    setup='from __main__ import '.format(funcname),
    number=100000)))

    print("nWith n=1000 inputs, d=4")

    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([rr(1, 16) for _ in range(1000)], 4)'.format(funcname),
    setup='from __main__ import rr, '.format(funcname),
    number=1000)))


    print("nWith n=1000 inputs, d=50")

    for funcname, time_for_me in results:
    print(" :20s :5.3f".format(funcname,
    timeit('([rr(1, 16) for _ in range(1000)], 50)'.format(funcname),
    setup='from __main__ import rr, '.format(funcname),
    number=1000)))






    share|improve this answer















    share|improve this answer



    share|improve this answer








    edited Jan 11 at 21:53


























    answered Jan 8 at 5:15









    Austin Hastings

    6,1591130




    6,1591130











    • Can you explain in a bit more detail what you mean by the a, z pair; the window and computation of median?
      – hjpotter92
      Jan 8 at 9:38










    • @Austin Hastings, based on your suggestion, I have made some revision to my code. Please let me know what you think. I am not getting the performance boost as I expected.
      – daniel lee
      Jan 11 at 19:23






    • 1




      First off, your median computation is still checking the parity of the window during each call. My suggestion was for you to do a check at the top of your main function (activityNotifications_bisect now) one time, and then dispatch to code that would compute the median correctly. I don't think you need a separate median function at all, if you want performance.
      – Austin Hastings
      Jan 11 at 19:28






    • 1




      Updated again. Note the effect of increasing d...
      – Austin Hastings
      Jan 11 at 21:54






    • 1




      Note that you might be able to squeeze some more performance out of insbisect using bisect.bisect_left and del instead of window.remove. I would expect it to win at d=500, not sure about 50, probably not at 4.
      – Austin Hastings
      Jan 11 at 21:59
















    • Can you explain in a bit more detail what you mean by the a, z pair; the window and computation of median?
      – hjpotter92
      Jan 8 at 9:38










    • @Austin Hastings, based on your suggestion, I have made some revision to my code. Please let me know what you think. I am not getting the performance boost as I expected.
      – daniel lee
      Jan 11 at 19:23






    • 1




      First off, your median computation is still checking the parity of the window during each call. My suggestion was for you to do a check at the top of your main function (activityNotifications_bisect now) one time, and then dispatch to code that would compute the median correctly. I don't think you need a separate median function at all, if you want performance.
      – Austin Hastings
      Jan 11 at 19:28






    • 1




      Updated again. Note the effect of increasing d...
      – Austin Hastings
      Jan 11 at 21:54






    • 1




      Note that you might be able to squeeze some more performance out of insbisect using bisect.bisect_left and del instead of window.remove. I would expect it to win at d=500, not sure about 50, probably not at 4.
      – Austin Hastings
      Jan 11 at 21:59















    Can you explain in a bit more detail what you mean by the a, z pair; the window and computation of median?
    – hjpotter92
    Jan 8 at 9:38




    Can you explain in a bit more detail what you mean by the a, z pair; the window and computation of median?
    – hjpotter92
    Jan 8 at 9:38












    @Austin Hastings, based on your suggestion, I have made some revision to my code. Please let me know what you think. I am not getting the performance boost as I expected.
    – daniel lee
    Jan 11 at 19:23




    @Austin Hastings, based on your suggestion, I have made some revision to my code. Please let me know what you think. I am not getting the performance boost as I expected.
    – daniel lee
    Jan 11 at 19:23




    1




    1




    First off, your median computation is still checking the parity of the window during each call. My suggestion was for you to do a check at the top of your main function (activityNotifications_bisect now) one time, and then dispatch to code that would compute the median correctly. I don't think you need a separate median function at all, if you want performance.
    – Austin Hastings
    Jan 11 at 19:28




    First off, your median computation is still checking the parity of the window during each call. My suggestion was for you to do a check at the top of your main function (activityNotifications_bisect now) one time, and then dispatch to code that would compute the median correctly. I don't think you need a separate median function at all, if you want performance.
    – Austin Hastings
    Jan 11 at 19:28




    1




    1




    Updated again. Note the effect of increasing d...
    – Austin Hastings
    Jan 11 at 21:54




    Updated again. Note the effect of increasing d...
    – Austin Hastings
    Jan 11 at 21:54




    1




    1




    Note that you might be able to squeeze some more performance out of insbisect using bisect.bisect_left and del instead of window.remove. I would expect it to win at d=500, not sure about 50, probably not at 4.
    – Austin Hastings
    Jan 11 at 21:59




    Note that you might be able to squeeze some more performance out of insbisect using bisect.bisect_left and del instead of window.remove. I would expect it to win at d=500, not sure about 50, probably not at 4.
    – Austin Hastings
    Jan 11 at 21:59












    up vote
    0
    down vote














    2X greater than or equal to the median




    A more precise formulation would be "greater than or equal to twice the median"




    I believe it's O(n2∗log(n)).




    There is no n given in this problem. The n in O(n) isn't just some dummy variable; it refers to an actual parameter in the problem space. Saying "the complexity is O(n)", when you haven't said what n is, is meaningless.



    Every time you move the window, all but one element is already sorted. You can do a binary search to find where to insert the new element, giving O(log(D)) complexity. For D = 4, that won't save much time and probably isn't worth the overhead, but for large D it would be a significant savings.






    share|improve this answer

























      up vote
      0
      down vote














      2X greater than or equal to the median




      A more precise formulation would be "greater than or equal to twice the median"




      I believe it's O(n2∗log(n)).




      There is no n given in this problem. The n in O(n) isn't just some dummy variable; it refers to an actual parameter in the problem space. Saying "the complexity is O(n)", when you haven't said what n is, is meaningless.



      Every time you move the window, all but one element is already sorted. You can do a binary search to find where to insert the new element, giving O(log(D)) complexity. For D = 4, that won't save much time and probably isn't worth the overhead, but for large D it would be a significant savings.






      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote










        2X greater than or equal to the median




        A more precise formulation would be "greater than or equal to twice the median"




        I believe it's O(n2∗log(n)).




        There is no n given in this problem. The n in O(n) isn't just some dummy variable; it refers to an actual parameter in the problem space. Saying "the complexity is O(n)", when you haven't said what n is, is meaningless.



        Every time you move the window, all but one element is already sorted. You can do a binary search to find where to insert the new element, giving O(log(D)) complexity. For D = 4, that won't save much time and probably isn't worth the overhead, but for large D it would be a significant savings.






        share|improve this answer














        2X greater than or equal to the median




        A more precise formulation would be "greater than or equal to twice the median"




        I believe it's O(n2∗log(n)).




        There is no n given in this problem. The n in O(n) isn't just some dummy variable; it refers to an actual parameter in the problem space. Saying "the complexity is O(n)", when you haven't said what n is, is meaningless.



        Every time you move the window, all but one element is already sorted. You can do a binary search to find where to insert the new element, giving O(log(D)) complexity. For D = 4, that won't save much time and probably isn't worth the overhead, but for large D it would be a significant savings.







        share|improve this answer













        share|improve this answer



        share|improve this answer











        answered Jan 8 at 17:20









        Acccumulation

        1,0195




        1,0195






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f184553%2fcount-how-many-times-a-value-is-greater-than-or-equal-to-the-median-of-previous%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Chat program with C++ and SFML

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            Will my employers contract hold up in court?