Removing neighbors in a point cloud

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
13
down vote

favorite












I have written a program to optimize a point cloud in dependency of their distances to each other. The code works very well for smaller number of points. For 1700 points it takes ca. 6 minutes. But I have 300000 points in the point cloud. The code is still running after almost 30 hours.



I am sure there is a pythonic way to optimze the code. But I do not know how can I reduce the calculation time and make the performance better. I have read alot about multithreding, chunk size etc. But I think there is no need for this size of data. I think, I am using the memory in a very bad way. Am I right? Here is the code:



import pickle
import math
import itertools
import gc

def dist(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)

def join_pair(points, r):
for p, q in itertools.combinations(points, 2):
if dist(p, q) < r:
points.remove(q)
return True
return False


if __name__ == "__main__":
filename = 'points.txt'
filename2 = 'points_modified.txt'
mynumbers =
with open(filename) as f:
for line in f:
mynumbers.append([float(n) for n in line.strip().split(' ')])
mynumbers = sorted(mynumbers, key=lambda tup: (tup[1], tup[0]))
gc.collect()
while join_pair(mynumbers, 3.6):
pass

with open (filename2,"w")as fp2:
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')


and the input data is like the following:



661234.58 5763766.03 72.63
661254.81 5763765.08 75.04
661244.86 5763764.14 74.99
661234.90 5763763.21 74.94
661225.13 5763762.29 74.89






share|improve this question





















  • can you give a few lines of points.txt as sample data?
    – Maarten Fabré
    Jun 8 at 9:36










  • can you use external libraries as numpy or pandas?
    – Maarten Fabré
    Jun 8 at 9:36










  • a bit more explanation on what join_pair needs to do would be welcome too
    – Maarten Fabré
    Jun 8 at 9:37










  • A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
    – Mohamad Reza Salehi Sadaghiani
    Jun 8 at 9:53










  • there is no need for this size of dat - is the last word supposed to be "data" or "that"?
    – mkrieger1
    Jun 9 at 7:58
















up vote
13
down vote

favorite












I have written a program to optimize a point cloud in dependency of their distances to each other. The code works very well for smaller number of points. For 1700 points it takes ca. 6 minutes. But I have 300000 points in the point cloud. The code is still running after almost 30 hours.



I am sure there is a pythonic way to optimze the code. But I do not know how can I reduce the calculation time and make the performance better. I have read alot about multithreding, chunk size etc. But I think there is no need for this size of data. I think, I am using the memory in a very bad way. Am I right? Here is the code:



import pickle
import math
import itertools
import gc

def dist(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)

def join_pair(points, r):
for p, q in itertools.combinations(points, 2):
if dist(p, q) < r:
points.remove(q)
return True
return False


if __name__ == "__main__":
filename = 'points.txt'
filename2 = 'points_modified.txt'
mynumbers =
with open(filename) as f:
for line in f:
mynumbers.append([float(n) for n in line.strip().split(' ')])
mynumbers = sorted(mynumbers, key=lambda tup: (tup[1], tup[0]))
gc.collect()
while join_pair(mynumbers, 3.6):
pass

with open (filename2,"w")as fp2:
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')


and the input data is like the following:



661234.58 5763766.03 72.63
661254.81 5763765.08 75.04
661244.86 5763764.14 74.99
661234.90 5763763.21 74.94
661225.13 5763762.29 74.89






share|improve this question





















  • can you give a few lines of points.txt as sample data?
    – Maarten Fabré
    Jun 8 at 9:36










  • can you use external libraries as numpy or pandas?
    – Maarten Fabré
    Jun 8 at 9:36










  • a bit more explanation on what join_pair needs to do would be welcome too
    – Maarten Fabré
    Jun 8 at 9:37










  • A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
    – Mohamad Reza Salehi Sadaghiani
    Jun 8 at 9:53










  • there is no need for this size of dat - is the last word supposed to be "data" or "that"?
    – mkrieger1
    Jun 9 at 7:58












up vote
13
down vote

favorite









up vote
13
down vote

favorite











I have written a program to optimize a point cloud in dependency of their distances to each other. The code works very well for smaller number of points. For 1700 points it takes ca. 6 minutes. But I have 300000 points in the point cloud. The code is still running after almost 30 hours.



I am sure there is a pythonic way to optimze the code. But I do not know how can I reduce the calculation time and make the performance better. I have read alot about multithreding, chunk size etc. But I think there is no need for this size of data. I think, I am using the memory in a very bad way. Am I right? Here is the code:



import pickle
import math
import itertools
import gc

def dist(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)

def join_pair(points, r):
for p, q in itertools.combinations(points, 2):
if dist(p, q) < r:
points.remove(q)
return True
return False


if __name__ == "__main__":
filename = 'points.txt'
filename2 = 'points_modified.txt'
mynumbers =
with open(filename) as f:
for line in f:
mynumbers.append([float(n) for n in line.strip().split(' ')])
mynumbers = sorted(mynumbers, key=lambda tup: (tup[1], tup[0]))
gc.collect()
while join_pair(mynumbers, 3.6):
pass

with open (filename2,"w")as fp2:
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')


and the input data is like the following:



661234.58 5763766.03 72.63
661254.81 5763765.08 75.04
661244.86 5763764.14 74.99
661234.90 5763763.21 74.94
661225.13 5763762.29 74.89






share|improve this question













I have written a program to optimize a point cloud in dependency of their distances to each other. The code works very well for smaller number of points. For 1700 points it takes ca. 6 minutes. But I have 300000 points in the point cloud. The code is still running after almost 30 hours.



I am sure there is a pythonic way to optimze the code. But I do not know how can I reduce the calculation time and make the performance better. I have read alot about multithreding, chunk size etc. But I think there is no need for this size of data. I think, I am using the memory in a very bad way. Am I right? Here is the code:



import pickle
import math
import itertools
import gc

def dist(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)

def join_pair(points, r):
for p, q in itertools.combinations(points, 2):
if dist(p, q) < r:
points.remove(q)
return True
return False


if __name__ == "__main__":
filename = 'points.txt'
filename2 = 'points_modified.txt'
mynumbers =
with open(filename) as f:
for line in f:
mynumbers.append([float(n) for n in line.strip().split(' ')])
mynumbers = sorted(mynumbers, key=lambda tup: (tup[1], tup[0]))
gc.collect()
while join_pair(mynumbers, 3.6):
pass

with open (filename2,"w")as fp2:
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')


and the input data is like the following:



661234.58 5763766.03 72.63
661254.81 5763765.08 75.04
661244.86 5763764.14 74.99
661234.90 5763763.21 74.94
661225.13 5763762.29 74.89








share|improve this question












share|improve this question




share|improve this question








edited Jun 9 at 14:42
























asked Jun 8 at 9:30









Mohamad Reza Salehi Sadaghiani

1206




1206











  • can you give a few lines of points.txt as sample data?
    – Maarten Fabré
    Jun 8 at 9:36










  • can you use external libraries as numpy or pandas?
    – Maarten Fabré
    Jun 8 at 9:36










  • a bit more explanation on what join_pair needs to do would be welcome too
    – Maarten Fabré
    Jun 8 at 9:37










  • A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
    – Mohamad Reza Salehi Sadaghiani
    Jun 8 at 9:53










  • there is no need for this size of dat - is the last word supposed to be "data" or "that"?
    – mkrieger1
    Jun 9 at 7:58
















  • can you give a few lines of points.txt as sample data?
    – Maarten Fabré
    Jun 8 at 9:36










  • can you use external libraries as numpy or pandas?
    – Maarten Fabré
    Jun 8 at 9:36










  • a bit more explanation on what join_pair needs to do would be welcome too
    – Maarten Fabré
    Jun 8 at 9:37










  • A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
    – Mohamad Reza Salehi Sadaghiani
    Jun 8 at 9:53










  • there is no need for this size of dat - is the last word supposed to be "data" or "that"?
    – mkrieger1
    Jun 9 at 7:58















can you give a few lines of points.txt as sample data?
– Maarten Fabré
Jun 8 at 9:36




can you give a few lines of points.txt as sample data?
– Maarten Fabré
Jun 8 at 9:36












can you use external libraries as numpy or pandas?
– Maarten Fabré
Jun 8 at 9:36




can you use external libraries as numpy or pandas?
– Maarten Fabré
Jun 8 at 9:36












a bit more explanation on what join_pair needs to do would be welcome too
– Maarten Fabré
Jun 8 at 9:37




a bit more explanation on what join_pair needs to do would be welcome too
– Maarten Fabré
Jun 8 at 9:37












A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
– Mohamad Reza Salehi Sadaghiani
Jun 8 at 9:53




A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
– Mohamad Reza Salehi Sadaghiani
Jun 8 at 9:53












there is no need for this size of dat - is the last word supposed to be "data" or "that"?
– mkrieger1
Jun 9 at 7:58




there is no need for this size of dat - is the last word supposed to be "data" or "that"?
– mkrieger1
Jun 9 at 7:58










2 Answers
2






active

oldest

votes

















up vote
28
down vote



accepted










1. Introduction



It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.



In this situation you need to (i) analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii) measure the performance of the code on small inputs; (iii) use the results of (i) and (ii) to extrapolate to the runtime for the full data set.



2. Review



  1. There are no docstrings. What are these functions supposed to do?



  2. dist can be simplified using math.hypot:



    def dist(p, q):
    "Return the Euclidean distance between points p and q."
    return math.hypot(p[0] - q[0], p[1] - q[1])


  3. The name join_pairs is misleading because it does not join anything. I am sure you could pick a better name.



  4. It would be useful (for testing and performance measurement) to have a function that calls join_pairs in a loop until it returns False. I am going to use this function:



    def sparse_subset1(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    points = points[:]
    while join_pairs(points, r):
    pass
    return points


    The reason that I am taking a copy of points (rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.



3. Performance




  1. Here's a test case of 1,000 random points:



    import random
    POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]


    The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:



    >>> from timeit import timeit
    >>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
    10.972541099996306


    Why does it take so long? The problem is that join_pairs removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.



    We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:



    >>> import numpy as np
    >>> n = np.arange(100, 1001, 100)
    >>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
    >>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]


    We can plot this on a log-log graph using matplotlib:



    >>> import matplotlib.pyplot as plt
    >>> plt.xlabel("n")
    >>> plt.ylabel("t (seconds)")
    >>> plt.loglog(n, t, 'r+', label="data")
    >>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
    >>> plt.legend()
    >>> plt.grid(alpha=0.25)
    >>> plt.show()




    So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:



    >>> a * 300000**3
    339550658.803593


    That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.




  2. It would be better to do the work in a single pass over the list of points, like this:



    def sparse_subset2(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    for p in points:
    if all(dist(p, q) >= r for q in result):
    result.append(p)
    return result


    This builds up the modified point cloud in the list result and returns it (instead of modifying the input points). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.



    To check that this is correct, we can compare the output against the original version of the code:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
    True


    This revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
    0.04830290700192563



  3. To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.



    import rtree

    def sparse_subset3(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    index = rtree.index.Index()
    for i, p in enumerate(points):
    px, py = p
    nearby = index.intersection((px - r, py - r, px + r, py + r))
    if all(dist(p, points[j]) >= r for j in nearby):
    result.append(p)
    index.insert(i, (px, py, px, py))
    return result


    When this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point p for inclusion in the result set, the code queries a rectangle centred on p with width and height 2 * r. All points within a distance r of p are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant than r. That's why the code has to further check each point in the query rectangle.



    Again, we need to check that this is correct:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
    True


    An R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of sparse_subset3 to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.



    Let's make a much bigger test case, with 100,000 points:



    POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]


    The performance of sparse_subset2 and sparse_subset3 depends on $m$, the number of points in the result. We expect that when r is big, $m$ will be small and so there won't be much difference. Here we use r=0.05 and get a subset of 287 points, and the two algorithms have almost identical runtime:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
    3.5980678769992664
    >>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
    3.598255231976509


    But if we reduce r to 0.01, getting a subset of 5994 points, then the R-tree shines:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
    87.03980600900832
    >>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
    4.6150341169850435


    (I didn't try sparse_subset1 on this test case as it will take many hours.)



    The worst case for sparse_subset3 is when r is zero and so all the points in the input end up in the result:



    >>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
    10.394442071992671


    Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.







share|improve this answer



















  • 5




    It’s not often you get to speed up someone’s program by a million times.
    – Gareth Rees
    Jun 8 at 19:12










  • My one nitpick here would be don't use hypot. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
    – Oscar Smith
    Jun 9 at 0:49










  • @OscarSmith: I tried it just now and I find that calling hypot is about twice as fast as computing the squared distance.
    – Gareth Rees
    Jun 9 at 8:25










  • WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:19










  • ok, I'm surprised
    – Oscar Smith
    Jun 9 at 14:26

















up vote
10
down vote













Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy instead of rtree. I also would like to point out a few other minutia I think are important or worth knowing.



1.
import gc



You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.



2.
itertools.combinations(points, 2)



This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree or cKDTree to accomplish a similar task to rtree. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree, they are implemented as a pre-compiled c library for much better speed than pure python.



In particular I'd like to use KDTree.query_pairs to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j) where i < j. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.



Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple and numpy.array in a few places for it to work perfectly...



#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set


3.
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')



This may be nit-picky of me, but using a range() to iterate over something that is already iterable is something that could be written more concisely and beautifully:



for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)





share|improve this answer





















  • Nice answer! I considered KDTree.query_pairs but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, then KDTree.query_pairs finds $Theta(n^2)$ pairs.
    – Gareth Rees
    Jun 9 at 8:27










  • Great lessons about gc and cKDTree! my respect and thanks!
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:20










  • I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:40










  • I have found my mistake. The variable "mynumbers" was empty. Thanks again.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 17:05






  • 1




    @MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that as r goes from 0 → ∞ the time complexity will go from Θ(n) → Θ(n^2)
    – Aaron
    Jun 10 at 20:56










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196104%2fremoving-neighbors-in-a-point-cloud%23new-answer', 'question_page');

);

Post as a guest






























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
28
down vote



accepted










1. Introduction



It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.



In this situation you need to (i) analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii) measure the performance of the code on small inputs; (iii) use the results of (i) and (ii) to extrapolate to the runtime for the full data set.



2. Review



  1. There are no docstrings. What are these functions supposed to do?



  2. dist can be simplified using math.hypot:



    def dist(p, q):
    "Return the Euclidean distance between points p and q."
    return math.hypot(p[0] - q[0], p[1] - q[1])


  3. The name join_pairs is misleading because it does not join anything. I am sure you could pick a better name.



  4. It would be useful (for testing and performance measurement) to have a function that calls join_pairs in a loop until it returns False. I am going to use this function:



    def sparse_subset1(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    points = points[:]
    while join_pairs(points, r):
    pass
    return points


    The reason that I am taking a copy of points (rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.



3. Performance




  1. Here's a test case of 1,000 random points:



    import random
    POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]


    The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:



    >>> from timeit import timeit
    >>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
    10.972541099996306


    Why does it take so long? The problem is that join_pairs removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.



    We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:



    >>> import numpy as np
    >>> n = np.arange(100, 1001, 100)
    >>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
    >>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]


    We can plot this on a log-log graph using matplotlib:



    >>> import matplotlib.pyplot as plt
    >>> plt.xlabel("n")
    >>> plt.ylabel("t (seconds)")
    >>> plt.loglog(n, t, 'r+', label="data")
    >>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
    >>> plt.legend()
    >>> plt.grid(alpha=0.25)
    >>> plt.show()




    So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:



    >>> a * 300000**3
    339550658.803593


    That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.




  2. It would be better to do the work in a single pass over the list of points, like this:



    def sparse_subset2(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    for p in points:
    if all(dist(p, q) >= r for q in result):
    result.append(p)
    return result


    This builds up the modified point cloud in the list result and returns it (instead of modifying the input points). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.



    To check that this is correct, we can compare the output against the original version of the code:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
    True


    This revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
    0.04830290700192563



  3. To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.



    import rtree

    def sparse_subset3(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    index = rtree.index.Index()
    for i, p in enumerate(points):
    px, py = p
    nearby = index.intersection((px - r, py - r, px + r, py + r))
    if all(dist(p, points[j]) >= r for j in nearby):
    result.append(p)
    index.insert(i, (px, py, px, py))
    return result


    When this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point p for inclusion in the result set, the code queries a rectangle centred on p with width and height 2 * r. All points within a distance r of p are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant than r. That's why the code has to further check each point in the query rectangle.



    Again, we need to check that this is correct:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
    True


    An R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of sparse_subset3 to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.



    Let's make a much bigger test case, with 100,000 points:



    POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]


    The performance of sparse_subset2 and sparse_subset3 depends on $m$, the number of points in the result. We expect that when r is big, $m$ will be small and so there won't be much difference. Here we use r=0.05 and get a subset of 287 points, and the two algorithms have almost identical runtime:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
    3.5980678769992664
    >>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
    3.598255231976509


    But if we reduce r to 0.01, getting a subset of 5994 points, then the R-tree shines:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
    87.03980600900832
    >>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
    4.6150341169850435


    (I didn't try sparse_subset1 on this test case as it will take many hours.)



    The worst case for sparse_subset3 is when r is zero and so all the points in the input end up in the result:



    >>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
    10.394442071992671


    Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.







share|improve this answer



















  • 5




    It’s not often you get to speed up someone’s program by a million times.
    – Gareth Rees
    Jun 8 at 19:12










  • My one nitpick here would be don't use hypot. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
    – Oscar Smith
    Jun 9 at 0:49










  • @OscarSmith: I tried it just now and I find that calling hypot is about twice as fast as computing the squared distance.
    – Gareth Rees
    Jun 9 at 8:25










  • WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:19










  • ok, I'm surprised
    – Oscar Smith
    Jun 9 at 14:26














up vote
28
down vote



accepted










1. Introduction



It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.



In this situation you need to (i) analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii) measure the performance of the code on small inputs; (iii) use the results of (i) and (ii) to extrapolate to the runtime for the full data set.



2. Review



  1. There are no docstrings. What are these functions supposed to do?



  2. dist can be simplified using math.hypot:



    def dist(p, q):
    "Return the Euclidean distance between points p and q."
    return math.hypot(p[0] - q[0], p[1] - q[1])


  3. The name join_pairs is misleading because it does not join anything. I am sure you could pick a better name.



  4. It would be useful (for testing and performance measurement) to have a function that calls join_pairs in a loop until it returns False. I am going to use this function:



    def sparse_subset1(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    points = points[:]
    while join_pairs(points, r):
    pass
    return points


    The reason that I am taking a copy of points (rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.



3. Performance




  1. Here's a test case of 1,000 random points:



    import random
    POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]


    The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:



    >>> from timeit import timeit
    >>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
    10.972541099996306


    Why does it take so long? The problem is that join_pairs removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.



    We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:



    >>> import numpy as np
    >>> n = np.arange(100, 1001, 100)
    >>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
    >>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]


    We can plot this on a log-log graph using matplotlib:



    >>> import matplotlib.pyplot as plt
    >>> plt.xlabel("n")
    >>> plt.ylabel("t (seconds)")
    >>> plt.loglog(n, t, 'r+', label="data")
    >>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
    >>> plt.legend()
    >>> plt.grid(alpha=0.25)
    >>> plt.show()




    So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:



    >>> a * 300000**3
    339550658.803593


    That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.




  2. It would be better to do the work in a single pass over the list of points, like this:



    def sparse_subset2(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    for p in points:
    if all(dist(p, q) >= r for q in result):
    result.append(p)
    return result


    This builds up the modified point cloud in the list result and returns it (instead of modifying the input points). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.



    To check that this is correct, we can compare the output against the original version of the code:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
    True


    This revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
    0.04830290700192563



  3. To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.



    import rtree

    def sparse_subset3(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    index = rtree.index.Index()
    for i, p in enumerate(points):
    px, py = p
    nearby = index.intersection((px - r, py - r, px + r, py + r))
    if all(dist(p, points[j]) >= r for j in nearby):
    result.append(p)
    index.insert(i, (px, py, px, py))
    return result


    When this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point p for inclusion in the result set, the code queries a rectangle centred on p with width and height 2 * r. All points within a distance r of p are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant than r. That's why the code has to further check each point in the query rectangle.



    Again, we need to check that this is correct:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
    True


    An R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of sparse_subset3 to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.



    Let's make a much bigger test case, with 100,000 points:



    POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]


    The performance of sparse_subset2 and sparse_subset3 depends on $m$, the number of points in the result. We expect that when r is big, $m$ will be small and so there won't be much difference. Here we use r=0.05 and get a subset of 287 points, and the two algorithms have almost identical runtime:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
    3.5980678769992664
    >>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
    3.598255231976509


    But if we reduce r to 0.01, getting a subset of 5994 points, then the R-tree shines:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
    87.03980600900832
    >>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
    4.6150341169850435


    (I didn't try sparse_subset1 on this test case as it will take many hours.)



    The worst case for sparse_subset3 is when r is zero and so all the points in the input end up in the result:



    >>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
    10.394442071992671


    Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.







share|improve this answer



















  • 5




    It’s not often you get to speed up someone’s program by a million times.
    – Gareth Rees
    Jun 8 at 19:12










  • My one nitpick here would be don't use hypot. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
    – Oscar Smith
    Jun 9 at 0:49










  • @OscarSmith: I tried it just now and I find that calling hypot is about twice as fast as computing the squared distance.
    – Gareth Rees
    Jun 9 at 8:25










  • WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:19










  • ok, I'm surprised
    – Oscar Smith
    Jun 9 at 14:26












up vote
28
down vote



accepted







up vote
28
down vote



accepted






1. Introduction



It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.



In this situation you need to (i) analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii) measure the performance of the code on small inputs; (iii) use the results of (i) and (ii) to extrapolate to the runtime for the full data set.



2. Review



  1. There are no docstrings. What are these functions supposed to do?



  2. dist can be simplified using math.hypot:



    def dist(p, q):
    "Return the Euclidean distance between points p and q."
    return math.hypot(p[0] - q[0], p[1] - q[1])


  3. The name join_pairs is misleading because it does not join anything. I am sure you could pick a better name.



  4. It would be useful (for testing and performance measurement) to have a function that calls join_pairs in a loop until it returns False. I am going to use this function:



    def sparse_subset1(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    points = points[:]
    while join_pairs(points, r):
    pass
    return points


    The reason that I am taking a copy of points (rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.



3. Performance




  1. Here's a test case of 1,000 random points:



    import random
    POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]


    The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:



    >>> from timeit import timeit
    >>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
    10.972541099996306


    Why does it take so long? The problem is that join_pairs removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.



    We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:



    >>> import numpy as np
    >>> n = np.arange(100, 1001, 100)
    >>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
    >>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]


    We can plot this on a log-log graph using matplotlib:



    >>> import matplotlib.pyplot as plt
    >>> plt.xlabel("n")
    >>> plt.ylabel("t (seconds)")
    >>> plt.loglog(n, t, 'r+', label="data")
    >>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
    >>> plt.legend()
    >>> plt.grid(alpha=0.25)
    >>> plt.show()




    So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:



    >>> a * 300000**3
    339550658.803593


    That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.




  2. It would be better to do the work in a single pass over the list of points, like this:



    def sparse_subset2(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    for p in points:
    if all(dist(p, q) >= r for q in result):
    result.append(p)
    return result


    This builds up the modified point cloud in the list result and returns it (instead of modifying the input points). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.



    To check that this is correct, we can compare the output against the original version of the code:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
    True


    This revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
    0.04830290700192563



  3. To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.



    import rtree

    def sparse_subset3(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    index = rtree.index.Index()
    for i, p in enumerate(points):
    px, py = p
    nearby = index.intersection((px - r, py - r, px + r, py + r))
    if all(dist(p, points[j]) >= r for j in nearby):
    result.append(p)
    index.insert(i, (px, py, px, py))
    return result


    When this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point p for inclusion in the result set, the code queries a rectangle centred on p with width and height 2 * r. All points within a distance r of p are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant than r. That's why the code has to further check each point in the query rectangle.



    Again, we need to check that this is correct:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
    True


    An R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of sparse_subset3 to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.



    Let's make a much bigger test case, with 100,000 points:



    POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]


    The performance of sparse_subset2 and sparse_subset3 depends on $m$, the number of points in the result. We expect that when r is big, $m$ will be small and so there won't be much difference. Here we use r=0.05 and get a subset of 287 points, and the two algorithms have almost identical runtime:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
    3.5980678769992664
    >>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
    3.598255231976509


    But if we reduce r to 0.01, getting a subset of 5994 points, then the R-tree shines:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
    87.03980600900832
    >>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
    4.6150341169850435


    (I didn't try sparse_subset1 on this test case as it will take many hours.)



    The worst case for sparse_subset3 is when r is zero and so all the points in the input end up in the result:



    >>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
    10.394442071992671


    Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.







share|improve this answer















1. Introduction



It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.



In this situation you need to (i) analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii) measure the performance of the code on small inputs; (iii) use the results of (i) and (ii) to extrapolate to the runtime for the full data set.



2. Review



  1. There are no docstrings. What are these functions supposed to do?



  2. dist can be simplified using math.hypot:



    def dist(p, q):
    "Return the Euclidean distance between points p and q."
    return math.hypot(p[0] - q[0], p[1] - q[1])


  3. The name join_pairs is misleading because it does not join anything. I am sure you could pick a better name.



  4. It would be useful (for testing and performance measurement) to have a function that calls join_pairs in a loop until it returns False. I am going to use this function:



    def sparse_subset1(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    points = points[:]
    while join_pairs(points, r):
    pass
    return points


    The reason that I am taking a copy of points (rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.



3. Performance




  1. Here's a test case of 1,000 random points:



    import random
    POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]


    The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:



    >>> from timeit import timeit
    >>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
    10.972541099996306


    Why does it take so long? The problem is that join_pairs removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.



    We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:



    >>> import numpy as np
    >>> n = np.arange(100, 1001, 100)
    >>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
    >>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]


    We can plot this on a log-log graph using matplotlib:



    >>> import matplotlib.pyplot as plt
    >>> plt.xlabel("n")
    >>> plt.ylabel("t (seconds)")
    >>> plt.loglog(n, t, 'r+', label="data")
    >>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
    >>> plt.legend()
    >>> plt.grid(alpha=0.25)
    >>> plt.show()




    So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:



    >>> a * 300000**3
    339550658.803593


    That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.




  2. It would be better to do the work in a single pass over the list of points, like this:



    def sparse_subset2(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    for p in points:
    if all(dist(p, q) >= r for q in result):
    result.append(p)
    return result


    This builds up the modified point cloud in the list result and returns it (instead of modifying the input points). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.



    To check that this is correct, we can compare the output against the original version of the code:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
    True


    This revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
    0.04830290700192563



  3. To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.



    import rtree

    def sparse_subset3(points, r):
    """Return a maximal list of elements of points such that no pairs of
    points in the result have distance less than r.

    """
    result =
    index = rtree.index.Index()
    for i, p in enumerate(points):
    px, py = p
    nearby = index.intersection((px - r, py - r, px + r, py + r))
    if all(dist(p, points[j]) >= r for j in nearby):
    result.append(p)
    index.insert(i, (px, py, px, py))
    return result


    When this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point p for inclusion in the result set, the code queries a rectangle centred on p with width and height 2 * r. All points within a distance r of p are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant than r. That's why the code has to further check each point in the query rectangle.



    Again, we need to check that this is correct:



    >>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
    True


    An R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of sparse_subset3 to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.



    Let's make a much bigger test case, with 100,000 points:



    POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]


    The performance of sparse_subset2 and sparse_subset3 depends on $m$, the number of points in the result. We expect that when r is big, $m$ will be small and so there won't be much difference. Here we use r=0.05 and get a subset of 287 points, and the two algorithms have almost identical runtime:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
    3.5980678769992664
    >>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
    3.598255231976509


    But if we reduce r to 0.01, getting a subset of 5994 points, then the R-tree shines:



    >>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
    87.03980600900832
    >>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
    4.6150341169850435


    (I didn't try sparse_subset1 on this test case as it will take many hours.)



    The worst case for sparse_subset3 is when r is zero and so all the points in the input end up in the result:



    >>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
    10.394442071992671


    Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.








share|improve this answer















share|improve this answer



share|improve this answer








edited Jun 8 at 12:09


























answered Jun 8 at 11:18









Gareth Rees

41.1k394166




41.1k394166







  • 5




    It’s not often you get to speed up someone’s program by a million times.
    – Gareth Rees
    Jun 8 at 19:12










  • My one nitpick here would be don't use hypot. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
    – Oscar Smith
    Jun 9 at 0:49










  • @OscarSmith: I tried it just now and I find that calling hypot is about twice as fast as computing the squared distance.
    – Gareth Rees
    Jun 9 at 8:25










  • WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:19










  • ok, I'm surprised
    – Oscar Smith
    Jun 9 at 14:26












  • 5




    It’s not often you get to speed up someone’s program by a million times.
    – Gareth Rees
    Jun 8 at 19:12










  • My one nitpick here would be don't use hypot. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
    – Oscar Smith
    Jun 9 at 0:49










  • @OscarSmith: I tried it just now and I find that calling hypot is about twice as fast as computing the squared distance.
    – Gareth Rees
    Jun 9 at 8:25










  • WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:19










  • ok, I'm surprised
    – Oscar Smith
    Jun 9 at 14:26







5




5




It’s not often you get to speed up someone’s program by a million times.
– Gareth Rees
Jun 8 at 19:12




It’s not often you get to speed up someone’s program by a million times.
– Gareth Rees
Jun 8 at 19:12












My one nitpick here would be don't use hypot. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
– Oscar Smith
Jun 9 at 0:49




My one nitpick here would be don't use hypot. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
– Oscar Smith
Jun 9 at 0:49












@OscarSmith: I tried it just now and I find that calling hypot is about twice as fast as computing the squared distance.
– Gareth Rees
Jun 9 at 8:25




@OscarSmith: I tried it just now and I find that calling hypot is about twice as fast as computing the squared distance.
– Gareth Rees
Jun 9 at 8:25












WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
– Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:19




WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
– Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:19












ok, I'm surprised
– Oscar Smith
Jun 9 at 14:26




ok, I'm surprised
– Oscar Smith
Jun 9 at 14:26












up vote
10
down vote













Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy instead of rtree. I also would like to point out a few other minutia I think are important or worth knowing.



1.
import gc



You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.



2.
itertools.combinations(points, 2)



This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree or cKDTree to accomplish a similar task to rtree. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree, they are implemented as a pre-compiled c library for much better speed than pure python.



In particular I'd like to use KDTree.query_pairs to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j) where i < j. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.



Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple and numpy.array in a few places for it to work perfectly...



#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set


3.
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')



This may be nit-picky of me, but using a range() to iterate over something that is already iterable is something that could be written more concisely and beautifully:



for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)





share|improve this answer





















  • Nice answer! I considered KDTree.query_pairs but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, then KDTree.query_pairs finds $Theta(n^2)$ pairs.
    – Gareth Rees
    Jun 9 at 8:27










  • Great lessons about gc and cKDTree! my respect and thanks!
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:20










  • I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:40










  • I have found my mistake. The variable "mynumbers" was empty. Thanks again.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 17:05






  • 1




    @MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that as r goes from 0 → ∞ the time complexity will go from Θ(n) → Θ(n^2)
    – Aaron
    Jun 10 at 20:56














up vote
10
down vote













Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy instead of rtree. I also would like to point out a few other minutia I think are important or worth knowing.



1.
import gc



You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.



2.
itertools.combinations(points, 2)



This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree or cKDTree to accomplish a similar task to rtree. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree, they are implemented as a pre-compiled c library for much better speed than pure python.



In particular I'd like to use KDTree.query_pairs to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j) where i < j. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.



Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple and numpy.array in a few places for it to work perfectly...



#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set


3.
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')



This may be nit-picky of me, but using a range() to iterate over something that is already iterable is something that could be written more concisely and beautifully:



for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)





share|improve this answer





















  • Nice answer! I considered KDTree.query_pairs but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, then KDTree.query_pairs finds $Theta(n^2)$ pairs.
    – Gareth Rees
    Jun 9 at 8:27










  • Great lessons about gc and cKDTree! my respect and thanks!
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:20










  • I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:40










  • I have found my mistake. The variable "mynumbers" was empty. Thanks again.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 17:05






  • 1




    @MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that as r goes from 0 → ∞ the time complexity will go from Θ(n) → Θ(n^2)
    – Aaron
    Jun 10 at 20:56












up vote
10
down vote










up vote
10
down vote









Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy instead of rtree. I also would like to point out a few other minutia I think are important or worth knowing.



1.
import gc



You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.



2.
itertools.combinations(points, 2)



This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree or cKDTree to accomplish a similar task to rtree. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree, they are implemented as a pre-compiled c library for much better speed than pure python.



In particular I'd like to use KDTree.query_pairs to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j) where i < j. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.



Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple and numpy.array in a few places for it to work perfectly...



#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set


3.
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')



This may be nit-picky of me, but using a range() to iterate over something that is already iterable is something that could be written more concisely and beautifully:



for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)





share|improve this answer













Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy instead of rtree. I also would like to point out a few other minutia I think are important or worth knowing.



1.
import gc



You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.



2.
itertools.combinations(points, 2)



This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree or cKDTree to accomplish a similar task to rtree. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree, they are implemented as a pre-compiled c library for much better speed than pure python.



In particular I'd like to use KDTree.query_pairs to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j) where i < j. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.



Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple and numpy.array in a few places for it to work perfectly...



#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set


3.
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')



This may be nit-picky of me, but using a range() to iterate over something that is already iterable is something that could be written more concisely and beautifully:



for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)






share|improve this answer













share|improve this answer



share|improve this answer











answered Jun 8 at 21:30









Aaron

2014




2014











  • Nice answer! I considered KDTree.query_pairs but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, then KDTree.query_pairs finds $Theta(n^2)$ pairs.
    – Gareth Rees
    Jun 9 at 8:27










  • Great lessons about gc and cKDTree! my respect and thanks!
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:20










  • I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:40










  • I have found my mistake. The variable "mynumbers" was empty. Thanks again.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 17:05






  • 1




    @MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that as r goes from 0 → ∞ the time complexity will go from Θ(n) → Θ(n^2)
    – Aaron
    Jun 10 at 20:56
















  • Nice answer! I considered KDTree.query_pairs but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, then KDTree.query_pairs finds $Theta(n^2)$ pairs.
    – Gareth Rees
    Jun 9 at 8:27










  • Great lessons about gc and cKDTree! my respect and thanks!
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:20










  • I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 10:40










  • I have found my mistake. The variable "mynumbers" was empty. Thanks again.
    – Mohamad Reza Salehi Sadaghiani
    Jun 9 at 17:05






  • 1




    @MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that as r goes from 0 → ∞ the time complexity will go from Θ(n) → Θ(n^2)
    – Aaron
    Jun 10 at 20:56















Nice answer! I considered KDTree.query_pairs but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, then KDTree.query_pairs finds $Theta(n^2)$ pairs.
– Gareth Rees
Jun 9 at 8:27




Nice answer! I considered KDTree.query_pairs but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, then KDTree.query_pairs finds $Theta(n^2)$ pairs.
– Gareth Rees
Jun 9 at 8:27












Great lessons about gc and cKDTree! my respect and thanks!
– Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:20




Great lessons about gc and cKDTree! my respect and thanks!
– Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:20












I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
– Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:40




I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
– Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:40












I have found my mistake. The variable "mynumbers" was empty. Thanks again.
– Mohamad Reza Salehi Sadaghiani
Jun 9 at 17:05




I have found my mistake. The variable "mynumbers" was empty. Thanks again.
– Mohamad Reza Salehi Sadaghiani
Jun 9 at 17:05




1




1




@MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that as r goes from 0 → ∞ the time complexity will go from Θ(n) → Θ(n^2)
– Aaron
Jun 10 at 20:56




@MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that as r goes from 0 → ∞ the time complexity will go from Θ(n) → Θ(n^2)
– Aaron
Jun 10 at 20:56












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196104%2fremoving-neighbors-in-a-point-cloud%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Will my employers contract hold up in court?