Variant of 2-SUM Problem using Hashtable with multi-threading
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:
The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.
The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.
Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t
Here is my entire Python code:
from collections import defaultdict
import threading
def hash_fn(n):
M=1999993
return n%M
def two_sum_present(target):
global ans
bool=True
for x in nums:
y = target-x
if y in my_hash_table[hash_fn(y)] and y!=x:
bool=False
ans+=1
return 1
if bool == True:
return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=
my_hash_table = defaultdict(list)
for line in f.readlines():
num = int(line.split('n')[0])
nums.append(num)
my_hash_table[hash_fn(num)].append(num)
ans=0
for nr in range(-10000, 10001):
thr = threading.Thread(target = two_sum_present, args=(nr,))
thr.start()
print(ans)
Notes:
I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:
ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))
The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.
'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair
python multithreading time-limit-exceeded hash-table
 |Â
show 8 more comments
up vote
4
down vote
favorite
I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:
The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.
The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.
Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t
Here is my entire Python code:
from collections import defaultdict
import threading
def hash_fn(n):
M=1999993
return n%M
def two_sum_present(target):
global ans
bool=True
for x in nums:
y = target-x
if y in my_hash_table[hash_fn(y)] and y!=x:
bool=False
ans+=1
return 1
if bool == True:
return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=
my_hash_table = defaultdict(list)
for line in f.readlines():
num = int(line.split('n')[0])
nums.append(num)
my_hash_table[hash_fn(num)].append(num)
ans=0
for nr in range(-10000, 10001):
thr = threading.Thread(target = two_sum_present, args=(nr,))
thr.start()
print(ans)
Notes:
I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:
ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))
The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.
'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair
python multithreading time-limit-exceeded hash-table
Please fix the indentation in your code. Thehash_fn
is misaligned for eg.
â hjpotter92
Apr 5 at 23:04
Sorry for bad indentation there. Fixed it.
â Piyush Singh
Apr 5 at 23:09
Welcome to Code Review! I hope you get some great answers.
â Phrancis
Apr 5 at 23:29
Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
â Turksarama
Apr 6 at 2:31
1
@PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
â miracle173
Apr 11 at 8:25
 |Â
show 8 more comments
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:
The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.
The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.
Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t
Here is my entire Python code:
from collections import defaultdict
import threading
def hash_fn(n):
M=1999993
return n%M
def two_sum_present(target):
global ans
bool=True
for x in nums:
y = target-x
if y in my_hash_table[hash_fn(y)] and y!=x:
bool=False
ans+=1
return 1
if bool == True:
return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=
my_hash_table = defaultdict(list)
for line in f.readlines():
num = int(line.split('n')[0])
nums.append(num)
my_hash_table[hash_fn(num)].append(num)
ans=0
for nr in range(-10000, 10001):
thr = threading.Thread(target = two_sum_present, args=(nr,))
thr.start()
print(ans)
Notes:
I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:
ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))
The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.
'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair
python multithreading time-limit-exceeded hash-table
I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:
The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.
The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.
Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t
Here is my entire Python code:
from collections import defaultdict
import threading
def hash_fn(n):
M=1999993
return n%M
def two_sum_present(target):
global ans
bool=True
for x in nums:
y = target-x
if y in my_hash_table[hash_fn(y)] and y!=x:
bool=False
ans+=1
return 1
if bool == True:
return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=
my_hash_table = defaultdict(list)
for line in f.readlines():
num = int(line.split('n')[0])
nums.append(num)
my_hash_table[hash_fn(num)].append(num)
ans=0
for nr in range(-10000, 10001):
thr = threading.Thread(target = two_sum_present, args=(nr,))
thr.start()
print(ans)
Notes:
I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:
ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))
The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.
'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair
python multithreading time-limit-exceeded hash-table
edited Apr 14 at 8:08
asked Apr 5 at 23:00
Piyush Singh
317
317
Please fix the indentation in your code. Thehash_fn
is misaligned for eg.
â hjpotter92
Apr 5 at 23:04
Sorry for bad indentation there. Fixed it.
â Piyush Singh
Apr 5 at 23:09
Welcome to Code Review! I hope you get some great answers.
â Phrancis
Apr 5 at 23:29
Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
â Turksarama
Apr 6 at 2:31
1
@PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
â miracle173
Apr 11 at 8:25
 |Â
show 8 more comments
Please fix the indentation in your code. Thehash_fn
is misaligned for eg.
â hjpotter92
Apr 5 at 23:04
Sorry for bad indentation there. Fixed it.
â Piyush Singh
Apr 5 at 23:09
Welcome to Code Review! I hope you get some great answers.
â Phrancis
Apr 5 at 23:29
Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
â Turksarama
Apr 6 at 2:31
1
@PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
â miracle173
Apr 11 at 8:25
Please fix the indentation in your code. The
hash_fn
is misaligned for eg.â hjpotter92
Apr 5 at 23:04
Please fix the indentation in your code. The
hash_fn
is misaligned for eg.â hjpotter92
Apr 5 at 23:04
Sorry for bad indentation there. Fixed it.
â Piyush Singh
Apr 5 at 23:09
Sorry for bad indentation there. Fixed it.
â Piyush Singh
Apr 5 at 23:09
Welcome to Code Review! I hope you get some great answers.
â Phrancis
Apr 5 at 23:29
Welcome to Code Review! I hope you get some great answers.
â Phrancis
Apr 5 at 23:29
Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
â Turksarama
Apr 6 at 2:31
Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
â Turksarama
Apr 6 at 2:31
1
1
@PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
â miracle173
Apr 11 at 8:25
@PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
â miracle173
Apr 11 at 8:25
 |Â
show 8 more comments
2 Answers
2
active
oldest
votes
up vote
2
down vote
functions
split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.
hashing
Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict
you use
global variable ans
Instead of passing the global variable ans
, the easiest way would be to use the fact that in a numeric context (like sum
), True
equals 1 and False
equals 0, so you can just return True or False in two_sum_present
my take on this:
read the file
Using the fact that a file is an iterator, I would do something like this
def read_file(filename):
with open(filename, 'r') as file:
return set(map(int, file))
find the target
About the same as your two_sum_present
with my remarks incorporated
def find_target(numbers, target):
for i in numbers:
if target - i in numbers and 2 * i != target:
return True
return False
iterate over the interval
def find_sums(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target(numbers, target)
putting it together
def main(filename, start=-10000, end=10000):
numbers = read_file(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums(numbers, start, end))
if __name__ == '__main__':
filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
print(main(filename, -100, 100))
All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time
This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101))
, but that one took 59s
Multithreading
If you want to, you can divide out all the call to find_target
to different workers/cores/... In this case, a frozenset
instead of a set
might be more appropriate since it's immutable, and will give less problems with concurrency
distinct
As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y
. To cover the second interpretation, You can change the set
to a collections.Counter
, and change the test criterium slightly
from collections import Counter
def read_file_distinct(filename):
with open(filename, 'r') as file:
return Counter(map(int, file))
def find_sums_distinct(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct(numbers, target)
def find_target_distinct(numbers, target):
for i in numbers:
if target - i in numbers and (2 * i != target or numbers[i] > 1):
return True
return False
def main_distinct(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums_distinct(numbers, start, end))
This has an effect on speed, though
%timeit main(filename, -10, 10)
4.51 s ñ 32.5 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
%timeit main_distinct(filename, -10, 10)
6.42 s ñ 48.4 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
distinct 2
A slightly different, faster approach can be used to tackle the 2nd interpretation
def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct2(numbers, repetitions, target)
def find_target_distinct2(numbers, repetitions, target):
for i in numbers:
if target - i in numbers and (2 * i != target or i in repetitions):
return True
return False
def main_distinct2(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
# print(repetitions)
numbers = set(numbers)
# print(f'len(numbers) numbers found, len(repetitions) repetitions')
return sum(find_sums_distinct2(numbers, repetitions, start, end))
%timeit main_distinct2(filename, -10, 10)
4.92 s ñ 160 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
This doesn't seem to handle the possibility of repetitions in the input.
â Gareth Rees
Apr 9 at 14:08
does it need to? The question mentionssuch that there are distinct numbers x,y in the input file that satisfy x+y=t
. If instead by distinct they also mean 2 repetitions of the same number, then aCounter
might be more appropriate, with a slightly revised test likeif target - i in numbers and (2 * i != target or numbers[i] > 1)
â Maarten Fabré
Apr 9 at 14:39
The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
â Gareth Rees
Apr 9 at 14:44
the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
â Maarten Fabré
Apr 9 at 14:45
@MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
â Piyush Singh
Apr 9 at 17:19
add a comment |Â
up vote
1
down vote
import bisect
num_set = set()
with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
for num in file.read().splitlines():
num = int(num)
num_set.add(num)
num_list = sorted(num_set)
targets = set()
for num in num_list:
low = bisect.bisect_left(num_list, -10000-num)
high = bisect.bisect_right(num_list, 10000-num)
for num_2 in num_list[low:high]:
if num_2 != num:
targets.add(num+num_2)
print(len(targets))
Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
functions
split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.
hashing
Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict
you use
global variable ans
Instead of passing the global variable ans
, the easiest way would be to use the fact that in a numeric context (like sum
), True
equals 1 and False
equals 0, so you can just return True or False in two_sum_present
my take on this:
read the file
Using the fact that a file is an iterator, I would do something like this
def read_file(filename):
with open(filename, 'r') as file:
return set(map(int, file))
find the target
About the same as your two_sum_present
with my remarks incorporated
def find_target(numbers, target):
for i in numbers:
if target - i in numbers and 2 * i != target:
return True
return False
iterate over the interval
def find_sums(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target(numbers, target)
putting it together
def main(filename, start=-10000, end=10000):
numbers = read_file(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums(numbers, start, end))
if __name__ == '__main__':
filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
print(main(filename, -100, 100))
All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time
This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101))
, but that one took 59s
Multithreading
If you want to, you can divide out all the call to find_target
to different workers/cores/... In this case, a frozenset
instead of a set
might be more appropriate since it's immutable, and will give less problems with concurrency
distinct
As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y
. To cover the second interpretation, You can change the set
to a collections.Counter
, and change the test criterium slightly
from collections import Counter
def read_file_distinct(filename):
with open(filename, 'r') as file:
return Counter(map(int, file))
def find_sums_distinct(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct(numbers, target)
def find_target_distinct(numbers, target):
for i in numbers:
if target - i in numbers and (2 * i != target or numbers[i] > 1):
return True
return False
def main_distinct(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums_distinct(numbers, start, end))
This has an effect on speed, though
%timeit main(filename, -10, 10)
4.51 s ñ 32.5 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
%timeit main_distinct(filename, -10, 10)
6.42 s ñ 48.4 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
distinct 2
A slightly different, faster approach can be used to tackle the 2nd interpretation
def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct2(numbers, repetitions, target)
def find_target_distinct2(numbers, repetitions, target):
for i in numbers:
if target - i in numbers and (2 * i != target or i in repetitions):
return True
return False
def main_distinct2(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
# print(repetitions)
numbers = set(numbers)
# print(f'len(numbers) numbers found, len(repetitions) repetitions')
return sum(find_sums_distinct2(numbers, repetitions, start, end))
%timeit main_distinct2(filename, -10, 10)
4.92 s ñ 160 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
This doesn't seem to handle the possibility of repetitions in the input.
â Gareth Rees
Apr 9 at 14:08
does it need to? The question mentionssuch that there are distinct numbers x,y in the input file that satisfy x+y=t
. If instead by distinct they also mean 2 repetitions of the same number, then aCounter
might be more appropriate, with a slightly revised test likeif target - i in numbers and (2 * i != target or numbers[i] > 1)
â Maarten Fabré
Apr 9 at 14:39
The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
â Gareth Rees
Apr 9 at 14:44
the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
â Maarten Fabré
Apr 9 at 14:45
@MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
â Piyush Singh
Apr 9 at 17:19
add a comment |Â
up vote
2
down vote
functions
split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.
hashing
Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict
you use
global variable ans
Instead of passing the global variable ans
, the easiest way would be to use the fact that in a numeric context (like sum
), True
equals 1 and False
equals 0, so you can just return True or False in two_sum_present
my take on this:
read the file
Using the fact that a file is an iterator, I would do something like this
def read_file(filename):
with open(filename, 'r') as file:
return set(map(int, file))
find the target
About the same as your two_sum_present
with my remarks incorporated
def find_target(numbers, target):
for i in numbers:
if target - i in numbers and 2 * i != target:
return True
return False
iterate over the interval
def find_sums(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target(numbers, target)
putting it together
def main(filename, start=-10000, end=10000):
numbers = read_file(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums(numbers, start, end))
if __name__ == '__main__':
filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
print(main(filename, -100, 100))
All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time
This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101))
, but that one took 59s
Multithreading
If you want to, you can divide out all the call to find_target
to different workers/cores/... In this case, a frozenset
instead of a set
might be more appropriate since it's immutable, and will give less problems with concurrency
distinct
As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y
. To cover the second interpretation, You can change the set
to a collections.Counter
, and change the test criterium slightly
from collections import Counter
def read_file_distinct(filename):
with open(filename, 'r') as file:
return Counter(map(int, file))
def find_sums_distinct(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct(numbers, target)
def find_target_distinct(numbers, target):
for i in numbers:
if target - i in numbers and (2 * i != target or numbers[i] > 1):
return True
return False
def main_distinct(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums_distinct(numbers, start, end))
This has an effect on speed, though
%timeit main(filename, -10, 10)
4.51 s ñ 32.5 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
%timeit main_distinct(filename, -10, 10)
6.42 s ñ 48.4 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
distinct 2
A slightly different, faster approach can be used to tackle the 2nd interpretation
def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct2(numbers, repetitions, target)
def find_target_distinct2(numbers, repetitions, target):
for i in numbers:
if target - i in numbers and (2 * i != target or i in repetitions):
return True
return False
def main_distinct2(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
# print(repetitions)
numbers = set(numbers)
# print(f'len(numbers) numbers found, len(repetitions) repetitions')
return sum(find_sums_distinct2(numbers, repetitions, start, end))
%timeit main_distinct2(filename, -10, 10)
4.92 s ñ 160 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
This doesn't seem to handle the possibility of repetitions in the input.
â Gareth Rees
Apr 9 at 14:08
does it need to? The question mentionssuch that there are distinct numbers x,y in the input file that satisfy x+y=t
. If instead by distinct they also mean 2 repetitions of the same number, then aCounter
might be more appropriate, with a slightly revised test likeif target - i in numbers and (2 * i != target or numbers[i] > 1)
â Maarten Fabré
Apr 9 at 14:39
The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
â Gareth Rees
Apr 9 at 14:44
the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
â Maarten Fabré
Apr 9 at 14:45
@MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
â Piyush Singh
Apr 9 at 17:19
add a comment |Â
up vote
2
down vote
up vote
2
down vote
functions
split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.
hashing
Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict
you use
global variable ans
Instead of passing the global variable ans
, the easiest way would be to use the fact that in a numeric context (like sum
), True
equals 1 and False
equals 0, so you can just return True or False in two_sum_present
my take on this:
read the file
Using the fact that a file is an iterator, I would do something like this
def read_file(filename):
with open(filename, 'r') as file:
return set(map(int, file))
find the target
About the same as your two_sum_present
with my remarks incorporated
def find_target(numbers, target):
for i in numbers:
if target - i in numbers and 2 * i != target:
return True
return False
iterate over the interval
def find_sums(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target(numbers, target)
putting it together
def main(filename, start=-10000, end=10000):
numbers = read_file(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums(numbers, start, end))
if __name__ == '__main__':
filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
print(main(filename, -100, 100))
All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time
This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101))
, but that one took 59s
Multithreading
If you want to, you can divide out all the call to find_target
to different workers/cores/... In this case, a frozenset
instead of a set
might be more appropriate since it's immutable, and will give less problems with concurrency
distinct
As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y
. To cover the second interpretation, You can change the set
to a collections.Counter
, and change the test criterium slightly
from collections import Counter
def read_file_distinct(filename):
with open(filename, 'r') as file:
return Counter(map(int, file))
def find_sums_distinct(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct(numbers, target)
def find_target_distinct(numbers, target):
for i in numbers:
if target - i in numbers and (2 * i != target or numbers[i] > 1):
return True
return False
def main_distinct(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums_distinct(numbers, start, end))
This has an effect on speed, though
%timeit main(filename, -10, 10)
4.51 s ñ 32.5 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
%timeit main_distinct(filename, -10, 10)
6.42 s ñ 48.4 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
distinct 2
A slightly different, faster approach can be used to tackle the 2nd interpretation
def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct2(numbers, repetitions, target)
def find_target_distinct2(numbers, repetitions, target):
for i in numbers:
if target - i in numbers and (2 * i != target or i in repetitions):
return True
return False
def main_distinct2(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
# print(repetitions)
numbers = set(numbers)
# print(f'len(numbers) numbers found, len(repetitions) repetitions')
return sum(find_sums_distinct2(numbers, repetitions, start, end))
%timeit main_distinct2(filename, -10, 10)
4.92 s ñ 160 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
functions
split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.
hashing
Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict
you use
global variable ans
Instead of passing the global variable ans
, the easiest way would be to use the fact that in a numeric context (like sum
), True
equals 1 and False
equals 0, so you can just return True or False in two_sum_present
my take on this:
read the file
Using the fact that a file is an iterator, I would do something like this
def read_file(filename):
with open(filename, 'r') as file:
return set(map(int, file))
find the target
About the same as your two_sum_present
with my remarks incorporated
def find_target(numbers, target):
for i in numbers:
if target - i in numbers and 2 * i != target:
return True
return False
iterate over the interval
def find_sums(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target(numbers, target)
putting it together
def main(filename, start=-10000, end=10000):
numbers = read_file(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums(numbers, start, end))
if __name__ == '__main__':
filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
print(main(filename, -100, 100))
All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time
This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101))
, but that one took 59s
Multithreading
If you want to, you can divide out all the call to find_target
to different workers/cores/... In this case, a frozenset
instead of a set
might be more appropriate since it's immutable, and will give less problems with concurrency
distinct
As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y
. To cover the second interpretation, You can change the set
to a collections.Counter
, and change the test criterium slightly
from collections import Counter
def read_file_distinct(filename):
with open(filename, 'r') as file:
return Counter(map(int, file))
def find_sums_distinct(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct(numbers, target)
def find_target_distinct(numbers, target):
for i in numbers:
if target - i in numbers and (2 * i != target or numbers[i] > 1):
return True
return False
def main_distinct(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums_distinct(numbers, start, end))
This has an effect on speed, though
%timeit main(filename, -10, 10)
4.51 s ñ 32.5 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
%timeit main_distinct(filename, -10, 10)
6.42 s ñ 48.4 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
distinct 2
A slightly different, faster approach can be used to tackle the 2nd interpretation
def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct2(numbers, repetitions, target)
def find_target_distinct2(numbers, repetitions, target):
for i in numbers:
if target - i in numbers and (2 * i != target or i in repetitions):
return True
return False
def main_distinct2(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
# print(repetitions)
numbers = set(numbers)
# print(f'len(numbers) numbers found, len(repetitions) repetitions')
return sum(find_sums_distinct2(numbers, repetitions, start, end))
%timeit main_distinct2(filename, -10, 10)
4.92 s ñ 160 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
edited Apr 9 at 15:04
answered Apr 9 at 12:25
Maarten Fabré
3,204214
3,204214
This doesn't seem to handle the possibility of repetitions in the input.
â Gareth Rees
Apr 9 at 14:08
does it need to? The question mentionssuch that there are distinct numbers x,y in the input file that satisfy x+y=t
. If instead by distinct they also mean 2 repetitions of the same number, then aCounter
might be more appropriate, with a slightly revised test likeif target - i in numbers and (2 * i != target or numbers[i] > 1)
â Maarten Fabré
Apr 9 at 14:39
The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
â Gareth Rees
Apr 9 at 14:44
the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
â Maarten Fabré
Apr 9 at 14:45
@MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
â Piyush Singh
Apr 9 at 17:19
add a comment |Â
This doesn't seem to handle the possibility of repetitions in the input.
â Gareth Rees
Apr 9 at 14:08
does it need to? The question mentionssuch that there are distinct numbers x,y in the input file that satisfy x+y=t
. If instead by distinct they also mean 2 repetitions of the same number, then aCounter
might be more appropriate, with a slightly revised test likeif target - i in numbers and (2 * i != target or numbers[i] > 1)
â Maarten Fabré
Apr 9 at 14:39
The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
â Gareth Rees
Apr 9 at 14:44
the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
â Maarten Fabré
Apr 9 at 14:45
@MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
â Piyush Singh
Apr 9 at 17:19
This doesn't seem to handle the possibility of repetitions in the input.
â Gareth Rees
Apr 9 at 14:08
This doesn't seem to handle the possibility of repetitions in the input.
â Gareth Rees
Apr 9 at 14:08
does it need to? The question mentions
such that there are distinct numbers x,y in the input file that satisfy x+y=t
. If instead by distinct they also mean 2 repetitions of the same number, then a Counter
might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
â Maarten Fabré
Apr 9 at 14:39
does it need to? The question mentions
such that there are distinct numbers x,y in the input file that satisfy x+y=t
. If instead by distinct they also mean 2 repetitions of the same number, then a Counter
might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
â Maarten Fabré
Apr 9 at 14:39
The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
â Gareth Rees
Apr 9 at 14:44
The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
â Gareth Rees
Apr 9 at 14:44
the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
â Maarten Fabré
Apr 9 at 14:45
the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
â Maarten Fabré
Apr 9 at 14:45
@MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
â Piyush Singh
Apr 9 at 17:19
@MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
â Piyush Singh
Apr 9 at 17:19
add a comment |Â
up vote
1
down vote
import bisect
num_set = set()
with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
for num in file.read().splitlines():
num = int(num)
num_set.add(num)
num_list = sorted(num_set)
targets = set()
for num in num_list:
low = bisect.bisect_left(num_list, -10000-num)
high = bisect.bisect_right(num_list, 10000-num)
for num_2 in num_list[low:high]:
if num_2 != num:
targets.add(num+num_2)
print(len(targets))
Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.
add a comment |Â
up vote
1
down vote
import bisect
num_set = set()
with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
for num in file.read().splitlines():
num = int(num)
num_set.add(num)
num_list = sorted(num_set)
targets = set()
for num in num_list:
low = bisect.bisect_left(num_list, -10000-num)
high = bisect.bisect_right(num_list, 10000-num)
for num_2 in num_list[low:high]:
if num_2 != num:
targets.add(num+num_2)
print(len(targets))
Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
import bisect
num_set = set()
with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
for num in file.read().splitlines():
num = int(num)
num_set.add(num)
num_list = sorted(num_set)
targets = set()
for num in num_list:
low = bisect.bisect_left(num_list, -10000-num)
high = bisect.bisect_right(num_list, 10000-num)
for num_2 in num_list[low:high]:
if num_2 != num:
targets.add(num+num_2)
print(len(targets))
Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.
import bisect
num_set = set()
with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
for num in file.read().splitlines():
num = int(num)
num_set.add(num)
num_list = sorted(num_set)
targets = set()
for num in num_list:
low = bisect.bisect_left(num_list, -10000-num)
high = bisect.bisect_right(num_list, 10000-num)
for num_2 in num_list[low:high]:
if num_2 != num:
targets.add(num+num_2)
print(len(targets))
Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.
edited Apr 10 at 12:26
Billal BEGUERADJ
1
1
answered Apr 9 at 17:21
Piyush Singh
317
317
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191365%2fvariant-of-2-sum-problem-using-hashtable-with-multi-threading%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Please fix the indentation in your code. The
hash_fn
is misaligned for eg.â hjpotter92
Apr 5 at 23:04
Sorry for bad indentation there. Fixed it.
â Piyush Singh
Apr 5 at 23:09
Welcome to Code Review! I hope you get some great answers.
â Phrancis
Apr 5 at 23:29
Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
â Turksarama
Apr 6 at 2:31
1
@PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
â miracle173
Apr 11 at 8:25