Variant of 2-SUM Problem using Hashtable with multi-threading

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite
1












I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:




The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.



The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.



Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t




Here is my entire Python code:



from collections import defaultdict
import threading

def hash_fn(n):
M=1999993
return n%M

def two_sum_present(target):
global ans
bool=True
for x in nums:
y = target-x
if y in my_hash_table[hash_fn(y)] and y!=x:
bool=False
ans+=1
return 1
if bool == True:
return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=
my_hash_table = defaultdict(list)
for line in f.readlines():
num = int(line.split('n')[0])
nums.append(num)
my_hash_table[hash_fn(num)].append(num)

ans=0
for nr in range(-10000, 10001):
thr = threading.Thread(target = two_sum_present, args=(nr,))
thr.start()

print(ans)


Notes:




  1. I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:



    ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))


  2. The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.


  3. 'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair







share|improve this question





















  • Please fix the indentation in your code. The hash_fn is misaligned for eg.
    – hjpotter92
    Apr 5 at 23:04










  • Sorry for bad indentation there. Fixed it.
    – Piyush Singh
    Apr 5 at 23:09










  • Welcome to Code Review! I hope you get some great answers.
    – Phrancis
    Apr 5 at 23:29










  • Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
    – Turksarama
    Apr 6 at 2:31






  • 1




    @PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
    – miracle173
    Apr 11 at 8:25

















up vote
4
down vote

favorite
1












I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:




The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.



The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.



Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t




Here is my entire Python code:



from collections import defaultdict
import threading

def hash_fn(n):
M=1999993
return n%M

def two_sum_present(target):
global ans
bool=True
for x in nums:
y = target-x
if y in my_hash_table[hash_fn(y)] and y!=x:
bool=False
ans+=1
return 1
if bool == True:
return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=
my_hash_table = defaultdict(list)
for line in f.readlines():
num = int(line.split('n')[0])
nums.append(num)
my_hash_table[hash_fn(num)].append(num)

ans=0
for nr in range(-10000, 10001):
thr = threading.Thread(target = two_sum_present, args=(nr,))
thr.start()

print(ans)


Notes:




  1. I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:



    ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))


  2. The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.


  3. 'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair







share|improve this question





















  • Please fix the indentation in your code. The hash_fn is misaligned for eg.
    – hjpotter92
    Apr 5 at 23:04










  • Sorry for bad indentation there. Fixed it.
    – Piyush Singh
    Apr 5 at 23:09










  • Welcome to Code Review! I hope you get some great answers.
    – Phrancis
    Apr 5 at 23:29










  • Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
    – Turksarama
    Apr 6 at 2:31






  • 1




    @PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
    – miracle173
    Apr 11 at 8:25













up vote
4
down vote

favorite
1









up vote
4
down vote

favorite
1






1





I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:




The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.



The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.



Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t




Here is my entire Python code:



from collections import defaultdict
import threading

def hash_fn(n):
M=1999993
return n%M

def two_sum_present(target):
global ans
bool=True
for x in nums:
y = target-x
if y in my_hash_table[hash_fn(y)] and y!=x:
bool=False
ans+=1
return 1
if bool == True:
return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=
my_hash_table = defaultdict(list)
for line in f.readlines():
num = int(line.split('n')[0])
nums.append(num)
my_hash_table[hash_fn(num)].append(num)

ans=0
for nr in range(-10000, 10001):
thr = threading.Thread(target = two_sum_present, args=(nr,))
thr.start()

print(ans)


Notes:




  1. I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:



    ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))


  2. The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.


  3. 'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair







share|improve this question













I am doing a Coursera assignment on Hashtable and my solution is taking too long to run. Here is the problem as described in the assignment:




The goal of this problem is to implement a variant of the 2-SUM algorithm covered in this week's lectures.



The file contains 1 million integers, both positive and negative (there might be some repetitions!).This is your array of integers, with the ith row of the file specifying the ith entry of the array.



Your task is to compute the number of target values t in the interval [-10000,10000] (inclusive) such that there are distinct numbers x,y in the input file that satisfy x+y=t




Here is my entire Python code:



from collections import defaultdict
import threading

def hash_fn(n):
M=1999993
return n%M

def two_sum_present(target):
global ans
bool=True
for x in nums:
y = target-x
if y in my_hash_table[hash_fn(y)] and y!=x:
bool=False
ans+=1
return 1
if bool == True:
return 0
f = open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt')
nums=
my_hash_table = defaultdict(list)
for line in f.readlines():
num = int(line.split('n')[0])
nums.append(num)
my_hash_table[hash_fn(num)].append(num)

ans=0
for nr in range(-10000, 10001):
thr = threading.Thread(target = two_sum_present, args=(nr,))
thr.start()

print(ans)


Notes:




  1. I found a page for the same problem here, but this solution uses neither hashing nor binary search, so it should take longer than my solution. This solution is essentially this:



    ans = sum(any(n-x in nums and 2*x != n for x in nums) for n in range(-10000, 10001))


  2. The text file can be found here. if someone is interested. But think of it as a big text file with 1M lines having 1 integer(+ve or -ve) on each line.


  3. 'distinct' as mentioned here only means that x+x=t is not a valid 2-sum pair; but two (x, y) pairs where x and y are not same constitute a valid pair









share|improve this question












share|improve this question




share|improve this question








edited Apr 14 at 8:08
























asked Apr 5 at 23:00









Piyush Singh

317




317











  • Please fix the indentation in your code. The hash_fn is misaligned for eg.
    – hjpotter92
    Apr 5 at 23:04










  • Sorry for bad indentation there. Fixed it.
    – Piyush Singh
    Apr 5 at 23:09










  • Welcome to Code Review! I hope you get some great answers.
    – Phrancis
    Apr 5 at 23:29










  • Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
    – Turksarama
    Apr 6 at 2:31






  • 1




    @PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
    – miracle173
    Apr 11 at 8:25

















  • Please fix the indentation in your code. The hash_fn is misaligned for eg.
    – hjpotter92
    Apr 5 at 23:04










  • Sorry for bad indentation there. Fixed it.
    – Piyush Singh
    Apr 5 at 23:09










  • Welcome to Code Review! I hope you get some great answers.
    – Phrancis
    Apr 5 at 23:29










  • Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
    – Turksarama
    Apr 6 at 2:31






  • 1




    @PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
    – miracle173
    Apr 11 at 8:25
















Please fix the indentation in your code. The hash_fn is misaligned for eg.
– hjpotter92
Apr 5 at 23:04




Please fix the indentation in your code. The hash_fn is misaligned for eg.
– hjpotter92
Apr 5 at 23:04












Sorry for bad indentation there. Fixed it.
– Piyush Singh
Apr 5 at 23:09




Sorry for bad indentation there. Fixed it.
– Piyush Singh
Apr 5 at 23:09












Welcome to Code Review! I hope you get some great answers.
– Phrancis
Apr 5 at 23:29




Welcome to Code Review! I hope you get some great answers.
– Phrancis
Apr 5 at 23:29












Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
– Turksarama
Apr 6 at 2:31




Your hash table will have collisions (you are not checking for equality before overwriting an existing entry) and your hash function is not really doing much for you. Try this trick, change your hash function to return the number you pass in to it, is it faster? I suspect it will be.
– Turksarama
Apr 6 at 2:31




1




1




@PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
– miracle173
Apr 11 at 8:25





@PiyushSingh I removed the link that you changed in your last edit because strange things happened when I clicked the link. Maybe the site is infected.
– miracle173
Apr 11 at 8:25











2 Answers
2






active

oldest

votes

















up vote
2
down vote













functions



split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.



hashing



Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict you use



global variable ans



Instead of passing the global variable ans, the easiest way would be to use the fact that in a numeric context (like sum), True equals 1 and False equals 0, so you can just return True or False in two_sum_present



my take on this:



read the file



Using the fact that a file is an iterator, I would do something like this



def read_file(filename):
with open(filename, 'r') as file:
return set(map(int, file))


find the target



About the same as your two_sum_present with my remarks incorporated



def find_target(numbers, target):
for i in numbers:
if target - i in numbers and 2 * i != target:
return True
return False


iterate over the interval



def find_sums(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target(numbers, target)


putting it together



def main(filename, start=-10000, end=10000):
numbers = read_file(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums(numbers, start, end))

if __name__ == '__main__':
filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
print(main(filename, -100, 100))


All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time



This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101)), but that one took 59s



Multithreading



If you want to, you can divide out all the call to find_target to different workers/cores/... In this case, a frozenset instead of a set might be more appropriate since it's immutable, and will give less problems with concurrency



distinct



As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y. To cover the second interpretation, You can change the set to a collections.Counter, and change the test criterium slightly



from collections import Counter

def read_file_distinct(filename):
with open(filename, 'r') as file:
return Counter(map(int, file))

def find_sums_distinct(numbers, start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct(numbers, target)

def find_target_distinct(numbers, target):
for i in numbers:
if target - i in numbers and (2 * i != target or numbers[i] > 1):
return True
return False

def main_distinct(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
# print(f'numbers found: len(numbers)')
return sum(find_sums_distinct(numbers, start, end))


This has an effect on speed, though




%timeit main(filename, -10, 10)



4.51 s ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



%timeit main_distinct(filename, -10, 10)



6.42 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


distinct 2



A slightly different, faster approach can be used to tackle the 2nd interpretation



def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
for target in range(start, end + 1):
# print(target)
yield find_target_distinct2(numbers, repetitions, target)

def find_target_distinct2(numbers, repetitions, target):
for i in numbers:
if target - i in numbers and (2 * i != target or i in repetitions):
return True
return False

def main_distinct2(filename, start=-10000, end=10000):
numbers = read_file_distinct(filename)
repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
# print(repetitions)
numbers = set(numbers)
# print(f'len(numbers) numbers found, len(repetitions) repetitions')
return sum(find_sums_distinct2(numbers, repetitions, start, end))



%timeit main_distinct2(filename, -10, 10)



4.92 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





share|improve this answer























  • This doesn't seem to handle the possibility of repetitions in the input.
    – Gareth Rees
    Apr 9 at 14:08










  • does it need to? The question mentions such that there are distinct numbers x,y in the input file that satisfy x+y=t. If instead by distinct they also mean 2 repetitions of the same number, then a Counter might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
    – Maarten Fabré
    Apr 9 at 14:39










  • The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
    – Gareth Rees
    Apr 9 at 14:44











  • the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
    – Maarten Fabré
    Apr 9 at 14:45










  • @MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
    – Piyush Singh
    Apr 9 at 17:19


















up vote
1
down vote













import bisect
num_set = set()
with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
for num in file.read().splitlines():
num = int(num)
num_set.add(num)

num_list = sorted(num_set)
targets = set()
for num in num_list:
low = bisect.bisect_left(num_list, -10000-num)
high = bisect.bisect_right(num_list, 10000-num)
for num_2 in num_list[low:high]:
if num_2 != num:
targets.add(num+num_2)

print(len(targets))


Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.






share|improve this answer























    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191365%2fvariant-of-2-sum-problem-using-hashtable-with-multi-threading%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote













    functions



    split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.



    hashing



    Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict you use



    global variable ans



    Instead of passing the global variable ans, the easiest way would be to use the fact that in a numeric context (like sum), True equals 1 and False equals 0, so you can just return True or False in two_sum_present



    my take on this:



    read the file



    Using the fact that a file is an iterator, I would do something like this



    def read_file(filename):
    with open(filename, 'r') as file:
    return set(map(int, file))


    find the target



    About the same as your two_sum_present with my remarks incorporated



    def find_target(numbers, target):
    for i in numbers:
    if target - i in numbers and 2 * i != target:
    return True
    return False


    iterate over the interval



    def find_sums(numbers, start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target(numbers, target)


    putting it together



    def main(filename, start=-10000, end=10000):
    numbers = read_file(filename)
    # print(f'numbers found: len(numbers)')
    return sum(find_sums(numbers, start, end))

    if __name__ == '__main__':
    filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
    print(main(filename, -100, 100))


    All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time



    This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101)), but that one took 59s



    Multithreading



    If you want to, you can divide out all the call to find_target to different workers/cores/... In this case, a frozenset instead of a set might be more appropriate since it's immutable, and will give less problems with concurrency



    distinct



    As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y. To cover the second interpretation, You can change the set to a collections.Counter, and change the test criterium slightly



    from collections import Counter

    def read_file_distinct(filename):
    with open(filename, 'r') as file:
    return Counter(map(int, file))

    def find_sums_distinct(numbers, start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target_distinct(numbers, target)

    def find_target_distinct(numbers, target):
    for i in numbers:
    if target - i in numbers and (2 * i != target or numbers[i] > 1):
    return True
    return False

    def main_distinct(filename, start=-10000, end=10000):
    numbers = read_file_distinct(filename)
    # print(f'numbers found: len(numbers)')
    return sum(find_sums_distinct(numbers, start, end))


    This has an effect on speed, though




    %timeit main(filename, -10, 10)



    4.51 s ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



    %timeit main_distinct(filename, -10, 10)



    6.42 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    distinct 2



    A slightly different, faster approach can be used to tackle the 2nd interpretation



    def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target_distinct2(numbers, repetitions, target)

    def find_target_distinct2(numbers, repetitions, target):
    for i in numbers:
    if target - i in numbers and (2 * i != target or i in repetitions):
    return True
    return False

    def main_distinct2(filename, start=-10000, end=10000):
    numbers = read_file_distinct(filename)
    repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
    # print(repetitions)
    numbers = set(numbers)
    # print(f'len(numbers) numbers found, len(repetitions) repetitions')
    return sum(find_sums_distinct2(numbers, repetitions, start, end))



    %timeit main_distinct2(filename, -10, 10)



    4.92 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





    share|improve this answer























    • This doesn't seem to handle the possibility of repetitions in the input.
      – Gareth Rees
      Apr 9 at 14:08










    • does it need to? The question mentions such that there are distinct numbers x,y in the input file that satisfy x+y=t. If instead by distinct they also mean 2 repetitions of the same number, then a Counter might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
      – Maarten Fabré
      Apr 9 at 14:39










    • The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
      – Gareth Rees
      Apr 9 at 14:44











    • the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
      – Maarten Fabré
      Apr 9 at 14:45










    • @MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
      – Piyush Singh
      Apr 9 at 17:19















    up vote
    2
    down vote













    functions



    split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.



    hashing



    Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict you use



    global variable ans



    Instead of passing the global variable ans, the easiest way would be to use the fact that in a numeric context (like sum), True equals 1 and False equals 0, so you can just return True or False in two_sum_present



    my take on this:



    read the file



    Using the fact that a file is an iterator, I would do something like this



    def read_file(filename):
    with open(filename, 'r') as file:
    return set(map(int, file))


    find the target



    About the same as your two_sum_present with my remarks incorporated



    def find_target(numbers, target):
    for i in numbers:
    if target - i in numbers and 2 * i != target:
    return True
    return False


    iterate over the interval



    def find_sums(numbers, start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target(numbers, target)


    putting it together



    def main(filename, start=-10000, end=10000):
    numbers = read_file(filename)
    # print(f'numbers found: len(numbers)')
    return sum(find_sums(numbers, start, end))

    if __name__ == '__main__':
    filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
    print(main(filename, -100, 100))


    All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time



    This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101)), but that one took 59s



    Multithreading



    If you want to, you can divide out all the call to find_target to different workers/cores/... In this case, a frozenset instead of a set might be more appropriate since it's immutable, and will give less problems with concurrency



    distinct



    As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y. To cover the second interpretation, You can change the set to a collections.Counter, and change the test criterium slightly



    from collections import Counter

    def read_file_distinct(filename):
    with open(filename, 'r') as file:
    return Counter(map(int, file))

    def find_sums_distinct(numbers, start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target_distinct(numbers, target)

    def find_target_distinct(numbers, target):
    for i in numbers:
    if target - i in numbers and (2 * i != target or numbers[i] > 1):
    return True
    return False

    def main_distinct(filename, start=-10000, end=10000):
    numbers = read_file_distinct(filename)
    # print(f'numbers found: len(numbers)')
    return sum(find_sums_distinct(numbers, start, end))


    This has an effect on speed, though




    %timeit main(filename, -10, 10)



    4.51 s ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



    %timeit main_distinct(filename, -10, 10)



    6.42 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    distinct 2



    A slightly different, faster approach can be used to tackle the 2nd interpretation



    def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target_distinct2(numbers, repetitions, target)

    def find_target_distinct2(numbers, repetitions, target):
    for i in numbers:
    if target - i in numbers and (2 * i != target or i in repetitions):
    return True
    return False

    def main_distinct2(filename, start=-10000, end=10000):
    numbers = read_file_distinct(filename)
    repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
    # print(repetitions)
    numbers = set(numbers)
    # print(f'len(numbers) numbers found, len(repetitions) repetitions')
    return sum(find_sums_distinct2(numbers, repetitions, start, end))



    %timeit main_distinct2(filename, -10, 10)



    4.92 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





    share|improve this answer























    • This doesn't seem to handle the possibility of repetitions in the input.
      – Gareth Rees
      Apr 9 at 14:08










    • does it need to? The question mentions such that there are distinct numbers x,y in the input file that satisfy x+y=t. If instead by distinct they also mean 2 repetitions of the same number, then a Counter might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
      – Maarten Fabré
      Apr 9 at 14:39










    • The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
      – Gareth Rees
      Apr 9 at 14:44











    • the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
      – Maarten Fabré
      Apr 9 at 14:45










    • @MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
      – Piyush Singh
      Apr 9 at 17:19













    up vote
    2
    down vote










    up vote
    2
    down vote









    functions



    split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.



    hashing



    Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict you use



    global variable ans



    Instead of passing the global variable ans, the easiest way would be to use the fact that in a numeric context (like sum), True equals 1 and False equals 0, so you can just return True or False in two_sum_present



    my take on this:



    read the file



    Using the fact that a file is an iterator, I would do something like this



    def read_file(filename):
    with open(filename, 'r') as file:
    return set(map(int, file))


    find the target



    About the same as your two_sum_present with my remarks incorporated



    def find_target(numbers, target):
    for i in numbers:
    if target - i in numbers and 2 * i != target:
    return True
    return False


    iterate over the interval



    def find_sums(numbers, start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target(numbers, target)


    putting it together



    def main(filename, start=-10000, end=10000):
    numbers = read_file(filename)
    # print(f'numbers found: len(numbers)')
    return sum(find_sums(numbers, start, end))

    if __name__ == '__main__':
    filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
    print(main(filename, -100, 100))


    All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time



    This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101)), but that one took 59s



    Multithreading



    If you want to, you can divide out all the call to find_target to different workers/cores/... In this case, a frozenset instead of a set might be more appropriate since it's immutable, and will give less problems with concurrency



    distinct



    As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y. To cover the second interpretation, You can change the set to a collections.Counter, and change the test criterium slightly



    from collections import Counter

    def read_file_distinct(filename):
    with open(filename, 'r') as file:
    return Counter(map(int, file))

    def find_sums_distinct(numbers, start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target_distinct(numbers, target)

    def find_target_distinct(numbers, target):
    for i in numbers:
    if target - i in numbers and (2 * i != target or numbers[i] > 1):
    return True
    return False

    def main_distinct(filename, start=-10000, end=10000):
    numbers = read_file_distinct(filename)
    # print(f'numbers found: len(numbers)')
    return sum(find_sums_distinct(numbers, start, end))


    This has an effect on speed, though




    %timeit main(filename, -10, 10)



    4.51 s ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



    %timeit main_distinct(filename, -10, 10)



    6.42 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    distinct 2



    A slightly different, faster approach can be used to tackle the 2nd interpretation



    def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target_distinct2(numbers, repetitions, target)

    def find_target_distinct2(numbers, repetitions, target):
    for i in numbers:
    if target - i in numbers and (2 * i != target or i in repetitions):
    return True
    return False

    def main_distinct2(filename, start=-10000, end=10000):
    numbers = read_file_distinct(filename)
    repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
    # print(repetitions)
    numbers = set(numbers)
    # print(f'len(numbers) numbers found, len(repetitions) repetitions')
    return sum(find_sums_distinct2(numbers, repetitions, start, end))



    %timeit main_distinct2(filename, -10, 10)



    4.92 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





    share|improve this answer















    functions



    split the work you want to do in logical functions. I try to limit the work that is done in the general script (and thus the global namespace) as much as possible. Lookups of local variable are supposed to be faster, but it make the code easier to read and test as well, so win-win.



    hashing



    Why device your own hashing? just dumping all the numbers in a set will eliminate all the duplicates, and 1M integers is a limited amount of memory. Certainly a lot less that the defaultdict you use



    global variable ans



    Instead of passing the global variable ans, the easiest way would be to use the fact that in a numeric context (like sum), True equals 1 and False equals 0, so you can just return True or False in two_sum_present



    my take on this:



    read the file



    Using the fact that a file is an iterator, I would do something like this



    def read_file(filename):
    with open(filename, 'r') as file:
    return set(map(int, file))


    find the target



    About the same as your two_sum_present with my remarks incorporated



    def find_target(numbers, target):
    for i in numbers:
    if target - i in numbers and 2 * i != target:
    return True
    return False


    iterate over the interval



    def find_sums(numbers, start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target(numbers, target)


    putting it together



    def main(filename, start=-10000, end=10000):
    numbers = read_file(filename)
    # print(f'numbers found: len(numbers)')
    return sum(find_sums(numbers, start, end))

    if __name__ == '__main__':
    filename = 'data/_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt'
    print(main(filename, -100, 100))


    All in all, it took 43 seconds to scan those 201 numbers on my system, to find 5 matches. Reading the file took about .6s of that time



    This solution is essentially the same as sum(any(n-x in numbers and 2*x != n for x in numbers) for n in range(-100, 101)), but that one took 59s



    Multithreading



    If you want to, you can divide out all the call to find_target to different workers/cores/... In this case, a frozenset instead of a set might be more appropriate since it's immutable, and will give less problems with concurrency



    distinct



    As noted by Gareth Rees, the original question is ambiguous on what is mean with distinct numbers x,y. To cover the second interpretation, You can change the set to a collections.Counter, and change the test criterium slightly



    from collections import Counter

    def read_file_distinct(filename):
    with open(filename, 'r') as file:
    return Counter(map(int, file))

    def find_sums_distinct(numbers, start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target_distinct(numbers, target)

    def find_target_distinct(numbers, target):
    for i in numbers:
    if target - i in numbers and (2 * i != target or numbers[i] > 1):
    return True
    return False

    def main_distinct(filename, start=-10000, end=10000):
    numbers = read_file_distinct(filename)
    # print(f'numbers found: len(numbers)')
    return sum(find_sums_distinct(numbers, start, end))


    This has an effect on speed, though




    %timeit main(filename, -10, 10)



    4.51 s ± 32.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



    %timeit main_distinct(filename, -10, 10)



    6.42 s ± 48.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


    distinct 2



    A slightly different, faster approach can be used to tackle the 2nd interpretation



    def find_sums_distinct2(numbers, repetitions,start=-10000, end=10000):
    for target in range(start, end + 1):
    # print(target)
    yield find_target_distinct2(numbers, repetitions, target)

    def find_target_distinct2(numbers, repetitions, target):
    for i in numbers:
    if target - i in numbers and (2 * i != target or i in repetitions):
    return True
    return False

    def main_distinct2(filename, start=-10000, end=10000):
    numbers = read_file_distinct(filename)
    repetitions = k for k, v in numbers.items() if ((v > 1) and (start < k < end))
    # print(repetitions)
    numbers = set(numbers)
    # print(f'len(numbers) numbers found, len(repetitions) repetitions')
    return sum(find_sums_distinct2(numbers, repetitions, start, end))



    %timeit main_distinct2(filename, -10, 10)



    4.92 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)






    share|improve this answer















    share|improve this answer



    share|improve this answer








    edited Apr 9 at 15:04


























    answered Apr 9 at 12:25









    Maarten Fabré

    3,204214




    3,204214











    • This doesn't seem to handle the possibility of repetitions in the input.
      – Gareth Rees
      Apr 9 at 14:08










    • does it need to? The question mentions such that there are distinct numbers x,y in the input file that satisfy x+y=t. If instead by distinct they also mean 2 repetitions of the same number, then a Counter might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
      – Maarten Fabré
      Apr 9 at 14:39










    • The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
      – Gareth Rees
      Apr 9 at 14:44











    • the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
      – Maarten Fabré
      Apr 9 at 14:45










    • @MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
      – Piyush Singh
      Apr 9 at 17:19

















    • This doesn't seem to handle the possibility of repetitions in the input.
      – Gareth Rees
      Apr 9 at 14:08










    • does it need to? The question mentions such that there are distinct numbers x,y in the input file that satisfy x+y=t. If instead by distinct they also mean 2 repetitions of the same number, then a Counter might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
      – Maarten Fabré
      Apr 9 at 14:39










    • The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
      – Gareth Rees
      Apr 9 at 14:44











    • the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
      – Maarten Fabré
      Apr 9 at 14:45










    • @MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
      – Piyush Singh
      Apr 9 at 17:19
















    This doesn't seem to handle the possibility of repetitions in the input.
    – Gareth Rees
    Apr 9 at 14:08




    This doesn't seem to handle the possibility of repetitions in the input.
    – Gareth Rees
    Apr 9 at 14:08












    does it need to? The question mentions such that there are distinct numbers x,y in the input file that satisfy x+y=t. If instead by distinct they also mean 2 repetitions of the same number, then a Counter might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
    – Maarten Fabré
    Apr 9 at 14:39




    does it need to? The question mentions such that there are distinct numbers x,y in the input file that satisfy x+y=t. If instead by distinct they also mean 2 repetitions of the same number, then a Counter might be more appropriate, with a slightly revised test like if target - i in numbers and (2 * i != target or numbers[i] > 1)
    – Maarten Fabré
    Apr 9 at 14:39












    The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
    – Gareth Rees
    Apr 9 at 14:44





    The question is not exactly clear on this point. From a pragmatic point of view, why note "there might be some repetitions" unless this is relevant somehow? But on the other hand, you're right that "distinct" suggests $x ne y$.
    – Gareth Rees
    Apr 9 at 14:44













    the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
    – Maarten Fabré
    Apr 9 at 14:45




    the question is unclear on this point indeed, and I selected one interpretation. I will try to adapt the answer to include both interpretations
    – Maarten Fabré
    Apr 9 at 14:45












    @MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
    – Piyush Singh
    Apr 9 at 17:19





    @MaartenFabré I have added the meaning of 'distinct' in the question. Also in find_target function you should search only in the range [-10000-target, 10000-target], because, that's where any relevant complementary number for that number might lie; if any. Also you don't need to search for 10001 because we are not interested if any number pair add up to 10001. For completion, I am adding my own answer which ran in 2 seconds on my PC.
    – Piyush Singh
    Apr 9 at 17:19













    up vote
    1
    down vote













    import bisect
    num_set = set()
    with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
    for num in file.read().splitlines():
    num = int(num)
    num_set.add(num)

    num_list = sorted(num_set)
    targets = set()
    for num in num_list:
    low = bisect.bisect_left(num_list, -10000-num)
    high = bisect.bisect_right(num_list, 10000-num)
    for num_2 in num_list[low:high]:
    if num_2 != num:
    targets.add(num+num_2)

    print(len(targets))


    Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.






    share|improve this answer



























      up vote
      1
      down vote













      import bisect
      num_set = set()
      with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
      for num in file.read().splitlines():
      num = int(num)
      num_set.add(num)

      num_list = sorted(num_set)
      targets = set()
      for num in num_list:
      low = bisect.bisect_left(num_list, -10000-num)
      high = bisect.bisect_right(num_list, 10000-num)
      for num_2 in num_list[low:high]:
      if num_2 != num:
      targets.add(num+num_2)

      print(len(targets))


      Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.






      share|improve this answer

























        up vote
        1
        down vote










        up vote
        1
        down vote









        import bisect
        num_set = set()
        with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
        for num in file.read().splitlines():
        num = int(num)
        num_set.add(num)

        num_list = sorted(num_set)
        targets = set()
        for num in num_list:
        low = bisect.bisect_left(num_list, -10000-num)
        high = bisect.bisect_right(num_list, 10000-num)
        for num_2 in num_list[low:high]:
        if num_2 != num:
        targets.add(num+num_2)

        print(len(targets))


        Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.






        share|improve this answer















        import bisect
        num_set = set()
        with open('_6ec67df2804ff4b58ab21c12edcb21f8_algo1-programming_prob-2sum.txt') as file:
        for num in file.read().splitlines():
        num = int(num)
        num_set.add(num)

        num_list = sorted(num_set)
        targets = set()
        for num in num_list:
        low = bisect.bisect_left(num_list, -10000-num)
        high = bisect.bisect_right(num_list, 10000-num)
        for num_2 in num_list[low:high]:
        if num_2 != num:
        targets.add(num+num_2)

        print(len(targets))


        Using Maarten's code, I wrote this; but I limited my search range for each number to [-10000-number, 10000-number] because that's where any relevant number for that number will lie; if any. This helped greatly with the running time.







        share|improve this answer















        share|improve this answer



        share|improve this answer








        edited Apr 10 at 12:26









        Billal BEGUERADJ

        1




        1











        answered Apr 9 at 17:21









        Piyush Singh

        317




        317






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191365%2fvariant-of-2-sum-problem-using-hashtable-with-multi-threading%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Chat program with C++ and SFML

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            Will my employers contract hold up in court?