Removing neighbors in a point cloud
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
13
down vote
favorite
I have written a program to optimize a point cloud in dependency of their distances to each other. The code works very well for smaller number of points. For 1700 points it takes ca. 6 minutes. But I have 300000 points in the point cloud. The code is still running after almost 30 hours.
I am sure there is a pythonic way to optimze the code. But I do not know how can I reduce the calculation time and make the performance better. I have read alot about multithreding, chunk size etc. But I think there is no need for this size of data. I think, I am using the memory in a very bad way. Am I right? Here is the code:
import pickle
import math
import itertools
import gc
def dist(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)
def join_pair(points, r):
for p, q in itertools.combinations(points, 2):
if dist(p, q) < r:
points.remove(q)
return True
return False
if __name__ == "__main__":
filename = 'points.txt'
filename2 = 'points_modified.txt'
mynumbers =
with open(filename) as f:
for line in f:
mynumbers.append([float(n) for n in line.strip().split(' ')])
mynumbers = sorted(mynumbers, key=lambda tup: (tup[1], tup[0]))
gc.collect()
while join_pair(mynumbers, 3.6):
pass
with open (filename2,"w")as fp2:
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
and the input data is like the following:
661234.58 5763766.03 72.63
661254.81 5763765.08 75.04
661244.86 5763764.14 74.99
661234.90 5763763.21 74.94
661225.13 5763762.29 74.89
python algorithm time-limit-exceeded clustering
add a comment |Â
up vote
13
down vote
favorite
I have written a program to optimize a point cloud in dependency of their distances to each other. The code works very well for smaller number of points. For 1700 points it takes ca. 6 minutes. But I have 300000 points in the point cloud. The code is still running after almost 30 hours.
I am sure there is a pythonic way to optimze the code. But I do not know how can I reduce the calculation time and make the performance better. I have read alot about multithreding, chunk size etc. But I think there is no need for this size of data. I think, I am using the memory in a very bad way. Am I right? Here is the code:
import pickle
import math
import itertools
import gc
def dist(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)
def join_pair(points, r):
for p, q in itertools.combinations(points, 2):
if dist(p, q) < r:
points.remove(q)
return True
return False
if __name__ == "__main__":
filename = 'points.txt'
filename2 = 'points_modified.txt'
mynumbers =
with open(filename) as f:
for line in f:
mynumbers.append([float(n) for n in line.strip().split(' ')])
mynumbers = sorted(mynumbers, key=lambda tup: (tup[1], tup[0]))
gc.collect()
while join_pair(mynumbers, 3.6):
pass
with open (filename2,"w")as fp2:
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
and the input data is like the following:
661234.58 5763766.03 72.63
661254.81 5763765.08 75.04
661244.86 5763764.14 74.99
661234.90 5763763.21 74.94
661225.13 5763762.29 74.89
python algorithm time-limit-exceeded clustering
can you give a few lines ofpoints.txt
as sample data?
â Maarten Fabré
Jun 8 at 9:36
can you use external libraries asnumpy
orpandas
?
â Maarten Fabré
Jun 8 at 9:36
a bit more explanation on whatjoin_pair
needs to do would be welcome too
â Maarten Fabré
Jun 8 at 9:37
A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
â Mohamad Reza Salehi Sadaghiani
Jun 8 at 9:53
there is no need for this size of dat - is the last word supposed to be "data" or "that"?
â mkrieger1
Jun 9 at 7:58
add a comment |Â
up vote
13
down vote
favorite
up vote
13
down vote
favorite
I have written a program to optimize a point cloud in dependency of their distances to each other. The code works very well for smaller number of points. For 1700 points it takes ca. 6 minutes. But I have 300000 points in the point cloud. The code is still running after almost 30 hours.
I am sure there is a pythonic way to optimze the code. But I do not know how can I reduce the calculation time and make the performance better. I have read alot about multithreding, chunk size etc. But I think there is no need for this size of data. I think, I am using the memory in a very bad way. Am I right? Here is the code:
import pickle
import math
import itertools
import gc
def dist(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)
def join_pair(points, r):
for p, q in itertools.combinations(points, 2):
if dist(p, q) < r:
points.remove(q)
return True
return False
if __name__ == "__main__":
filename = 'points.txt'
filename2 = 'points_modified.txt'
mynumbers =
with open(filename) as f:
for line in f:
mynumbers.append([float(n) for n in line.strip().split(' ')])
mynumbers = sorted(mynumbers, key=lambda tup: (tup[1], tup[0]))
gc.collect()
while join_pair(mynumbers, 3.6):
pass
with open (filename2,"w")as fp2:
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
and the input data is like the following:
661234.58 5763766.03 72.63
661254.81 5763765.08 75.04
661244.86 5763764.14 74.99
661234.90 5763763.21 74.94
661225.13 5763762.29 74.89
python algorithm time-limit-exceeded clustering
I have written a program to optimize a point cloud in dependency of their distances to each other. The code works very well for smaller number of points. For 1700 points it takes ca. 6 minutes. But I have 300000 points in the point cloud. The code is still running after almost 30 hours.
I am sure there is a pythonic way to optimze the code. But I do not know how can I reduce the calculation time and make the performance better. I have read alot about multithreding, chunk size etc. But I think there is no need for this size of data. I think, I am using the memory in a very bad way. Am I right? Here is the code:
import pickle
import math
import itertools
import gc
def dist(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)
def join_pair(points, r):
for p, q in itertools.combinations(points, 2):
if dist(p, q) < r:
points.remove(q)
return True
return False
if __name__ == "__main__":
filename = 'points.txt'
filename2 = 'points_modified.txt'
mynumbers =
with open(filename) as f:
for line in f:
mynumbers.append([float(n) for n in line.strip().split(' ')])
mynumbers = sorted(mynumbers, key=lambda tup: (tup[1], tup[0]))
gc.collect()
while join_pair(mynumbers, 3.6):
pass
with open (filename2,"w")as fp2:
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
and the input data is like the following:
661234.58 5763766.03 72.63
661254.81 5763765.08 75.04
661244.86 5763764.14 74.99
661234.90 5763763.21 74.94
661225.13 5763762.29 74.89
python algorithm time-limit-exceeded clustering
edited Jun 9 at 14:42
asked Jun 8 at 9:30
Mohamad Reza Salehi Sadaghiani
1206
1206
can you give a few lines ofpoints.txt
as sample data?
â Maarten Fabré
Jun 8 at 9:36
can you use external libraries asnumpy
orpandas
?
â Maarten Fabré
Jun 8 at 9:36
a bit more explanation on whatjoin_pair
needs to do would be welcome too
â Maarten Fabré
Jun 8 at 9:37
A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
â Mohamad Reza Salehi Sadaghiani
Jun 8 at 9:53
there is no need for this size of dat - is the last word supposed to be "data" or "that"?
â mkrieger1
Jun 9 at 7:58
add a comment |Â
can you give a few lines ofpoints.txt
as sample data?
â Maarten Fabré
Jun 8 at 9:36
can you use external libraries asnumpy
orpandas
?
â Maarten Fabré
Jun 8 at 9:36
a bit more explanation on whatjoin_pair
needs to do would be welcome too
â Maarten Fabré
Jun 8 at 9:37
A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
â Mohamad Reza Salehi Sadaghiani
Jun 8 at 9:53
there is no need for this size of dat - is the last word supposed to be "data" or "that"?
â mkrieger1
Jun 9 at 7:58
can you give a few lines of
points.txt
as sample data?â Maarten Fabré
Jun 8 at 9:36
can you give a few lines of
points.txt
as sample data?â Maarten Fabré
Jun 8 at 9:36
can you use external libraries as
numpy
or pandas
?â Maarten Fabré
Jun 8 at 9:36
can you use external libraries as
numpy
or pandas
?â Maarten Fabré
Jun 8 at 9:36
a bit more explanation on what
join_pair
needs to do would be welcome tooâ Maarten Fabré
Jun 8 at 9:37
a bit more explanation on what
join_pair
needs to do would be welcome tooâ Maarten Fabré
Jun 8 at 9:37
A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
â Mohamad Reza Salehi Sadaghiani
Jun 8 at 9:53
A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
â Mohamad Reza Salehi Sadaghiani
Jun 8 at 9:53
there is no need for this size of dat - is the last word supposed to be "data" or "that"?
â mkrieger1
Jun 9 at 7:58
there is no need for this size of dat - is the last word supposed to be "data" or "that"?
â mkrieger1
Jun 9 at 7:58
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
28
down vote
accepted
1. Introduction
It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.
In this situation you need to (i)ÃÂ analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii)ÃÂ measure the performance of the code on small inputs; (iii)ÃÂ use the results of (i) and (ii) to extrapolate to the runtime for the full data set.
2. Review
There are no docstrings. What are these functions supposed to do?
dist
can be simplified usingmath.hypot
:def dist(p, q):
"Return the Euclidean distance between points p and q."
return math.hypot(p[0] - q[0], p[1] - q[1])The name
join_pairs
is misleading because it does not join anything. I am sure you could pick a better name.It would be useful (for testing and performance measurement) to have a function that calls
join_pairs
in a loop until it returnsFalse
. I am going to use this function:def sparse_subset1(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
points = points[:]
while join_pairs(points, r):
pass
return pointsThe reason that I am taking a copy of
points
(rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.
3. Performance
Here's a test case of 1,000 random points:
import random
POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:
>>> from timeit import timeit
>>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
10.972541099996306Why does it take so long? The problem is that
join_pairs
removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:
>>> import numpy as np
>>> n = np.arange(100, 1001, 100)
>>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
>>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]We can plot this on a log-log graph using matplotlib:
>>> import matplotlib.pyplot as plt
>>> plt.xlabel("n")
>>> plt.ylabel("t (seconds)")
>>> plt.loglog(n, t, 'r+', label="data")
>>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
>>> plt.legend()
>>> plt.grid(alpha=0.25)
>>> plt.show()So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:
>>> a * 300000**3
339550658.803593That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.
It would be better to do the work in a single pass over the list of points, like this:
def sparse_subset2(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
for p in points:
if all(dist(p, q) >= r for q in result):
result.append(p)
return resultThis builds up the modified point cloud in the list
result
and returns it (instead of modifying the inputpoints
). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.To check that this is correct, we can compare the output against the original version of the code:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
TrueThis revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:
>>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
0.04830290700192563To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.
import rtree
def sparse_subset3(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
index = rtree.index.Index()
for i, p in enumerate(points):
px, py = p
nearby = index.intersection((px - r, py - r, px + r, py + r))
if all(dist(p, points[j]) >= r for j in nearby):
result.append(p)
index.insert(i, (px, py, px, py))
return resultWhen this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point
p
for inclusion in the result set, the code queries a rectangle centred onp
with width and height2 * r
. All points within a distancer
ofp
are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant thanr
. That's why the code has to further check each point in the query rectangle.Again, we need to check that this is correct:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
TrueAn R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of
sparse_subset3
to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.Let's make a much bigger test case, with 100,000 points:
POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]
The performance of
sparse_subset2
andsparse_subset3
depends on $m$, the number of points in the result. We expect that whenr
is big, $m$ will be small and so there won't be much difference. Here we user=0.05
and get a subset of 287 points, and the two algorithms have almost identical runtime:>>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
3.5980678769992664
>>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
3.598255231976509But if we reduce
r
to0.01
, getting a subset of 5994 points, then the R-tree shines:>>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
87.03980600900832
>>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
4.6150341169850435(I didn't try
sparse_subset1
on this test case as it will take many hours.)The worst case for
sparse_subset3
is whenr
is zero and so all the points in the input end up in the result:>>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
10.394442071992671Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.
5
ItâÂÂs not often you get to speed up someoneâÂÂs program by a million times.
â Gareth Rees
Jun 8 at 19:12
My one nitpick here would be don't usehypot
. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
â Oscar Smith
Jun 9 at 0:49
@OscarSmith: I tried it just now and I find that callinghypot
is about twice as fast as computing the squared distance.
â Gareth Rees
Jun 9 at 8:25
WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:19
ok, I'm surprised
â Oscar Smith
Jun 9 at 14:26
 |Â
show 2 more comments
up vote
10
down vote
Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy
instead of rtree
. I also would like to point out a few other minutia I think are important or worth knowing.
1. import gc
import gc
You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.
2. itertools.combinations(points, 2)
itertools.combinations(points, 2)
This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree
or cKDTree
to accomplish a similar task to rtree
. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree
, they are implemented as a pre-compiled c library for much better speed than pure python.
In particular I'd like to use KDTree.query_pairs
to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j)
where i < j
. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.
Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple
and numpy.array
in a few places for it to work perfectly...
#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set
3. for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
This may be nit-picky of me, but using a range()
to iterate over something that is already iterable is something that could be written more concisely and beautifully:
for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)
Nice answer! I consideredKDTree.query_pairs
but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, thenKDTree.query_pairs
finds $Theta(n^2)$ pairs.
â Gareth Rees
Jun 9 at 8:27
Great lessons about gc and cKDTree! my respect and thanks!
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:20
I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:40
I have found my mistake. The variable "mynumbers" was empty. Thanks again.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 17:05
1
@MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that asr
goes from0 â âÂÂ
the time complexity will go fromÃÂ(n) â ÃÂ(n^2)
â Aaron
Jun 10 at 20:56
 |Â
show 3 more comments
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
28
down vote
accepted
1. Introduction
It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.
In this situation you need to (i)ÃÂ analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii)ÃÂ measure the performance of the code on small inputs; (iii)ÃÂ use the results of (i) and (ii) to extrapolate to the runtime for the full data set.
2. Review
There are no docstrings. What are these functions supposed to do?
dist
can be simplified usingmath.hypot
:def dist(p, q):
"Return the Euclidean distance between points p and q."
return math.hypot(p[0] - q[0], p[1] - q[1])The name
join_pairs
is misleading because it does not join anything. I am sure you could pick a better name.It would be useful (for testing and performance measurement) to have a function that calls
join_pairs
in a loop until it returnsFalse
. I am going to use this function:def sparse_subset1(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
points = points[:]
while join_pairs(points, r):
pass
return pointsThe reason that I am taking a copy of
points
(rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.
3. Performance
Here's a test case of 1,000 random points:
import random
POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:
>>> from timeit import timeit
>>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
10.972541099996306Why does it take so long? The problem is that
join_pairs
removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:
>>> import numpy as np
>>> n = np.arange(100, 1001, 100)
>>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
>>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]We can plot this on a log-log graph using matplotlib:
>>> import matplotlib.pyplot as plt
>>> plt.xlabel("n")
>>> plt.ylabel("t (seconds)")
>>> plt.loglog(n, t, 'r+', label="data")
>>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
>>> plt.legend()
>>> plt.grid(alpha=0.25)
>>> plt.show()So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:
>>> a * 300000**3
339550658.803593That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.
It would be better to do the work in a single pass over the list of points, like this:
def sparse_subset2(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
for p in points:
if all(dist(p, q) >= r for q in result):
result.append(p)
return resultThis builds up the modified point cloud in the list
result
and returns it (instead of modifying the inputpoints
). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.To check that this is correct, we can compare the output against the original version of the code:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
TrueThis revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:
>>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
0.04830290700192563To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.
import rtree
def sparse_subset3(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
index = rtree.index.Index()
for i, p in enumerate(points):
px, py = p
nearby = index.intersection((px - r, py - r, px + r, py + r))
if all(dist(p, points[j]) >= r for j in nearby):
result.append(p)
index.insert(i, (px, py, px, py))
return resultWhen this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point
p
for inclusion in the result set, the code queries a rectangle centred onp
with width and height2 * r
. All points within a distancer
ofp
are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant thanr
. That's why the code has to further check each point in the query rectangle.Again, we need to check that this is correct:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
TrueAn R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of
sparse_subset3
to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.Let's make a much bigger test case, with 100,000 points:
POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]
The performance of
sparse_subset2
andsparse_subset3
depends on $m$, the number of points in the result. We expect that whenr
is big, $m$ will be small and so there won't be much difference. Here we user=0.05
and get a subset of 287 points, and the two algorithms have almost identical runtime:>>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
3.5980678769992664
>>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
3.598255231976509But if we reduce
r
to0.01
, getting a subset of 5994 points, then the R-tree shines:>>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
87.03980600900832
>>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
4.6150341169850435(I didn't try
sparse_subset1
on this test case as it will take many hours.)The worst case for
sparse_subset3
is whenr
is zero and so all the points in the input end up in the result:>>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
10.394442071992671Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.
5
ItâÂÂs not often you get to speed up someoneâÂÂs program by a million times.
â Gareth Rees
Jun 8 at 19:12
My one nitpick here would be don't usehypot
. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
â Oscar Smith
Jun 9 at 0:49
@OscarSmith: I tried it just now and I find that callinghypot
is about twice as fast as computing the squared distance.
â Gareth Rees
Jun 9 at 8:25
WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:19
ok, I'm surprised
â Oscar Smith
Jun 9 at 14:26
 |Â
show 2 more comments
up vote
28
down vote
accepted
1. Introduction
It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.
In this situation you need to (i)ÃÂ analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii)ÃÂ measure the performance of the code on small inputs; (iii)ÃÂ use the results of (i) and (ii) to extrapolate to the runtime for the full data set.
2. Review
There are no docstrings. What are these functions supposed to do?
dist
can be simplified usingmath.hypot
:def dist(p, q):
"Return the Euclidean distance between points p and q."
return math.hypot(p[0] - q[0], p[1] - q[1])The name
join_pairs
is misleading because it does not join anything. I am sure you could pick a better name.It would be useful (for testing and performance measurement) to have a function that calls
join_pairs
in a loop until it returnsFalse
. I am going to use this function:def sparse_subset1(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
points = points[:]
while join_pairs(points, r):
pass
return pointsThe reason that I am taking a copy of
points
(rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.
3. Performance
Here's a test case of 1,000 random points:
import random
POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:
>>> from timeit import timeit
>>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
10.972541099996306Why does it take so long? The problem is that
join_pairs
removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:
>>> import numpy as np
>>> n = np.arange(100, 1001, 100)
>>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
>>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]We can plot this on a log-log graph using matplotlib:
>>> import matplotlib.pyplot as plt
>>> plt.xlabel("n")
>>> plt.ylabel("t (seconds)")
>>> plt.loglog(n, t, 'r+', label="data")
>>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
>>> plt.legend()
>>> plt.grid(alpha=0.25)
>>> plt.show()So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:
>>> a * 300000**3
339550658.803593That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.
It would be better to do the work in a single pass over the list of points, like this:
def sparse_subset2(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
for p in points:
if all(dist(p, q) >= r for q in result):
result.append(p)
return resultThis builds up the modified point cloud in the list
result
and returns it (instead of modifying the inputpoints
). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.To check that this is correct, we can compare the output against the original version of the code:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
TrueThis revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:
>>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
0.04830290700192563To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.
import rtree
def sparse_subset3(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
index = rtree.index.Index()
for i, p in enumerate(points):
px, py = p
nearby = index.intersection((px - r, py - r, px + r, py + r))
if all(dist(p, points[j]) >= r for j in nearby):
result.append(p)
index.insert(i, (px, py, px, py))
return resultWhen this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point
p
for inclusion in the result set, the code queries a rectangle centred onp
with width and height2 * r
. All points within a distancer
ofp
are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant thanr
. That's why the code has to further check each point in the query rectangle.Again, we need to check that this is correct:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
TrueAn R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of
sparse_subset3
to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.Let's make a much bigger test case, with 100,000 points:
POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]
The performance of
sparse_subset2
andsparse_subset3
depends on $m$, the number of points in the result. We expect that whenr
is big, $m$ will be small and so there won't be much difference. Here we user=0.05
and get a subset of 287 points, and the two algorithms have almost identical runtime:>>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
3.5980678769992664
>>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
3.598255231976509But if we reduce
r
to0.01
, getting a subset of 5994 points, then the R-tree shines:>>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
87.03980600900832
>>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
4.6150341169850435(I didn't try
sparse_subset1
on this test case as it will take many hours.)The worst case for
sparse_subset3
is whenr
is zero and so all the points in the input end up in the result:>>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
10.394442071992671Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.
5
ItâÂÂs not often you get to speed up someoneâÂÂs program by a million times.
â Gareth Rees
Jun 8 at 19:12
My one nitpick here would be don't usehypot
. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
â Oscar Smith
Jun 9 at 0:49
@OscarSmith: I tried it just now and I find that callinghypot
is about twice as fast as computing the squared distance.
â Gareth Rees
Jun 9 at 8:25
WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:19
ok, I'm surprised
â Oscar Smith
Jun 9 at 14:26
 |Â
show 2 more comments
up vote
28
down vote
accepted
up vote
28
down vote
accepted
1. Introduction
It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.
In this situation you need to (i)ÃÂ analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii)ÃÂ measure the performance of the code on small inputs; (iii)ÃÂ use the results of (i) and (ii) to extrapolate to the runtime for the full data set.
2. Review
There are no docstrings. What are these functions supposed to do?
dist
can be simplified usingmath.hypot
:def dist(p, q):
"Return the Euclidean distance between points p and q."
return math.hypot(p[0] - q[0], p[1] - q[1])The name
join_pairs
is misleading because it does not join anything. I am sure you could pick a better name.It would be useful (for testing and performance measurement) to have a function that calls
join_pairs
in a loop until it returnsFalse
. I am going to use this function:def sparse_subset1(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
points = points[:]
while join_pairs(points, r):
pass
return pointsThe reason that I am taking a copy of
points
(rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.
3. Performance
Here's a test case of 1,000 random points:
import random
POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:
>>> from timeit import timeit
>>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
10.972541099996306Why does it take so long? The problem is that
join_pairs
removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:
>>> import numpy as np
>>> n = np.arange(100, 1001, 100)
>>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
>>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]We can plot this on a log-log graph using matplotlib:
>>> import matplotlib.pyplot as plt
>>> plt.xlabel("n")
>>> plt.ylabel("t (seconds)")
>>> plt.loglog(n, t, 'r+', label="data")
>>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
>>> plt.legend()
>>> plt.grid(alpha=0.25)
>>> plt.show()So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:
>>> a * 300000**3
339550658.803593That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.
It would be better to do the work in a single pass over the list of points, like this:
def sparse_subset2(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
for p in points:
if all(dist(p, q) >= r for q in result):
result.append(p)
return resultThis builds up the modified point cloud in the list
result
and returns it (instead of modifying the inputpoints
). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.To check that this is correct, we can compare the output against the original version of the code:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
TrueThis revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:
>>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
0.04830290700192563To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.
import rtree
def sparse_subset3(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
index = rtree.index.Index()
for i, p in enumerate(points):
px, py = p
nearby = index.intersection((px - r, py - r, px + r, py + r))
if all(dist(p, points[j]) >= r for j in nearby):
result.append(p)
index.insert(i, (px, py, px, py))
return resultWhen this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point
p
for inclusion in the result set, the code queries a rectangle centred onp
with width and height2 * r
. All points within a distancer
ofp
are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant thanr
. That's why the code has to further check each point in the query rectangle.Again, we need to check that this is correct:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
TrueAn R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of
sparse_subset3
to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.Let's make a much bigger test case, with 100,000 points:
POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]
The performance of
sparse_subset2
andsparse_subset3
depends on $m$, the number of points in the result. We expect that whenr
is big, $m$ will be small and so there won't be much difference. Here we user=0.05
and get a subset of 287 points, and the two algorithms have almost identical runtime:>>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
3.5980678769992664
>>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
3.598255231976509But if we reduce
r
to0.01
, getting a subset of 5994 points, then the R-tree shines:>>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
87.03980600900832
>>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
4.6150341169850435(I didn't try
sparse_subset1
on this test case as it will take many hours.)The worst case for
sparse_subset3
is whenr
is zero and so all the points in the input end up in the result:>>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
10.394442071992671Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.
1. Introduction
It is not a good plan to leave code running for 30 hours. Until you have analyzed the code, and have figured out how its runtime varies according to the size of the input, then you have no idea how long it will take. Maybe it will take 31 hours, but maybe it will take a million hours.
In this situation you need to (i)ÃÂ analyze the algorithm to get an expression for its runtime complexity (that is, how its runtime depends on the size of the input); (ii)ÃÂ measure the performance of the code on small inputs; (iii)ÃÂ use the results of (i) and (ii) to extrapolate to the runtime for the full data set.
2. Review
There are no docstrings. What are these functions supposed to do?
dist
can be simplified usingmath.hypot
:def dist(p, q):
"Return the Euclidean distance between points p and q."
return math.hypot(p[0] - q[0], p[1] - q[1])The name
join_pairs
is misleading because it does not join anything. I am sure you could pick a better name.It would be useful (for testing and performance measurement) to have a function that calls
join_pairs
in a loop until it returnsFalse
. I am going to use this function:def sparse_subset1(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
points = points[:]
while join_pairs(points, r):
pass
return pointsThe reason that I am taking a copy of
points
(rather than modifying the original) is so that I can call this function for testing or performance measurement without destroying the input data. See below for some examples.
3. Performance
Here's a test case of 1,000 random points:
import random
POINTS = [[random.random() for _ in range(2)] for _ in range(1000)]The code in the post takes about 11 seconds to reduce this to a sparse subset of 286 points at least 0.04 apart:
>>> from timeit import timeit
>>> timeit(lambda:sparse_subset1(POINTS, 0.04), number=1)
10.972541099996306Why does it take so long? The problem is that
join_pairs
removes just one point each time it is called. This means that you have to call it in a loop, but each time you call it it starts over comparing the same pairs of points that were already compared in the previous iteration. As a result of this algorithm the runtime is $O(n^3)$: that is, the runtime grows like the cube of the length of the input.We can see this cubic behaviour by measuring the time $t$ taken by the function on a range of sizes $n$ and finding a cubic equation $t = an^3$ of best-fit:
>>> import numpy as np
>>> n = np.arange(100, 1001, 100)
>>> t = [timeit(lambda:sparse_subset1(POINTS[:i], 0.04), number=1) for i in n]
>>> a, = np.linalg.lstsq(n[:, np.newaxis] ** 3, t)[0]We can plot this on a log-log graph using matplotlib:
>>> import matplotlib.pyplot as plt
>>> plt.xlabel("n")
>>> plt.ylabel("t (seconds)")
>>> plt.loglog(n, t, 'r+', label="data")
>>> plt.loglog(n, a * t**3, label="least squares fit to $t = an^3$")
>>> plt.legend()
>>> plt.grid(alpha=0.25)
>>> plt.show()So extrapolating from this analysis and measurement, we estimate that if $n=300000$ then the time taken will be roughly:
>>> a * 300000**3
339550658.803593That's about ten years! Looking at the graph, it seems that the slope of the data is not quite as steep as the line of best-fit, so maybe the true exponent is a bit less than 3 and the extrapolated time somewhat less than ten years. But as a rough analysis this tells us that there's no hope for it finish in a few more hours, and so we have to find a better approach.
It would be better to do the work in a single pass over the list of points, like this:
def sparse_subset2(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
for p in points:
if all(dist(p, q) >= r for q in result):
result.append(p)
return resultThis builds up the modified point cloud in the list
result
and returns it (instead of modifying the inputpoints
). The code considers each point in the input, and adds it to the result so long as it is sufficiently far from all the points that have been previously added to the result.To check that this is correct, we can compare the output against the original version of the code:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset2(POINTS, 0.04)
TrueThis revised code takes $O(nm)$, where $n$ is the number of points in the input and $m$ is the number of points in the output. In the worst case $m = n$ and the code has to compare every pair of points. On the 1000-points test case, this is hundreds of times faster than the code in the post:
>>> timeit(lambda:sparse_subset2(POINTS, 0.04), number=1)
0.04830290700192563To further improve the performance, we need a spatial index, that is, a data structure supporting efficient nearest-neighbour queries. For this problem, I'm going to use the Rtree package from the Python Package Index, which implements the R-tree data structure. An R-tree maintains a collection of rectangles (or hyper-rectangles if the search space has more than two dimensions), and can efficiently find the rectangles in the collection that intersect with a query rectangle.
import rtree
def sparse_subset3(points, r):
"""Return a maximal list of elements of points such that no pairs of
points in the result have distance less than r.
"""
result =
index = rtree.index.Index()
for i, p in enumerate(points):
px, py = p
nearby = index.intersection((px - r, py - r, px + r, py + r))
if all(dist(p, points[j]) >= r for j in nearby):
result.append(p)
index.insert(i, (px, py, px, py))
return resultWhen this code appends a point to the result list, it also adds a corresponding rectangle to the R-tree. This rectangle has zero width and height, so it represents a single point. When considering a point
p
for inclusion in the result set, the code queries a rectangle centred onp
with width and height2 * r
. All points within a distancer
ofp
are inside this rectangle, but there might also be some points in the corners of the rectangle that are more distant thanr
. That's why the code has to further check each point in the query rectangle.Again, we need to check that this is correct:
>>> sparse_subset1(POINTS, 0.04) == sparse_subset3(POINTS, 0.04)
TrueAn R-tree takes $O(log m)$ to find an intersecting rectangle in an index of $m$ rectangles, so we expect the runtime of
sparse_subset3
to be $O(nlog m)$, where $n$ is the number of points in the input and $m$ is the number of points in the output.Let's make a much bigger test case, with 100,000 points:
POINTS = [[random.random() for _ in range(2)] for _ in range(10**5)]
The performance of
sparse_subset2
andsparse_subset3
depends on $m$, the number of points in the result. We expect that whenr
is big, $m$ will be small and so there won't be much difference. Here we user=0.05
and get a subset of 287 points, and the two algorithms have almost identical runtime:>>> timeit(lambda:sparse_subset2(POINTS, 0.05), number=1)
3.5980678769992664
>>> timeit(lambda:sparse_subset3(POINTS, 0.05), number=1)
3.598255231976509But if we reduce
r
to0.01
, getting a subset of 5994 points, then the R-tree shines:>>> timeit(lambda:sparse_subset2(POINTS, 0.01), number=1)
87.03980600900832
>>> timeit(lambda:sparse_subset3(POINTS, 0.01), number=1)
4.6150341169850435(I didn't try
sparse_subset1
on this test case as it will take many hours.)The worst case for
sparse_subset3
is whenr
is zero and so all the points in the input end up in the result:>>> timeit(lambda:sparse_subset3(POINTS, 0), number=1)
10.394442071992671Extrapolating from this, you should be able to run your 300,000-point case in under a minute, which is a big saving on ten years.
edited Jun 8 at 12:09
answered Jun 8 at 11:18
Gareth Rees
41.1k394166
41.1k394166
5
ItâÂÂs not often you get to speed up someoneâÂÂs program by a million times.
â Gareth Rees
Jun 8 at 19:12
My one nitpick here would be don't usehypot
. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
â Oscar Smith
Jun 9 at 0:49
@OscarSmith: I tried it just now and I find that callinghypot
is about twice as fast as computing the squared distance.
â Gareth Rees
Jun 9 at 8:25
WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:19
ok, I'm surprised
â Oscar Smith
Jun 9 at 14:26
 |Â
show 2 more comments
5
ItâÂÂs not often you get to speed up someoneâÂÂs program by a million times.
â Gareth Rees
Jun 8 at 19:12
My one nitpick here would be don't usehypot
. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.
â Oscar Smith
Jun 9 at 0:49
@OscarSmith: I tried it just now and I find that callinghypot
is about twice as fast as computing the squared distance.
â Gareth Rees
Jun 9 at 8:25
WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:19
ok, I'm surprised
â Oscar Smith
Jun 9 at 14:26
5
5
ItâÂÂs not often you get to speed up someoneâÂÂs program by a million times.
â Gareth Rees
Jun 8 at 19:12
ItâÂÂs not often you get to speed up someoneâÂÂs program by a million times.
â Gareth Rees
Jun 8 at 19:12
My one nitpick here would be don't use
hypot
. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.â Oscar Smith
Jun 9 at 0:49
My one nitpick here would be don't use
hypot
. It is a function built for numerical stability by using fancy trig. Instead, you only need to minimize distance squared, which op might have done faster.â Oscar Smith
Jun 9 at 0:49
@OscarSmith: I tried it just now and I find that calling
hypot
is about twice as fast as computing the squared distance.â Gareth Rees
Jun 9 at 8:25
@OscarSmith: I tried it just now and I find that calling
hypot
is about twice as fast as computing the squared distance.â Gareth Rees
Jun 9 at 8:25
WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:19
WOW! it is for me 1000 times better than a lecture. Thanks a lot for your knowledge,wisdom and time.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:19
ok, I'm surprised
â Oscar Smith
Jun 9 at 14:26
ok, I'm surprised
â Oscar Smith
Jun 9 at 14:26
 |Â
show 2 more comments
up vote
10
down vote
Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy
instead of rtree
. I also would like to point out a few other minutia I think are important or worth knowing.
1. import gc
import gc
You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.
2. itertools.combinations(points, 2)
itertools.combinations(points, 2)
This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree
or cKDTree
to accomplish a similar task to rtree
. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree
, they are implemented as a pre-compiled c library for much better speed than pure python.
In particular I'd like to use KDTree.query_pairs
to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j)
where i < j
. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.
Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple
and numpy.array
in a few places for it to work perfectly...
#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set
3. for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
This may be nit-picky of me, but using a range()
to iterate over something that is already iterable is something that could be written more concisely and beautifully:
for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)
Nice answer! I consideredKDTree.query_pairs
but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, thenKDTree.query_pairs
finds $Theta(n^2)$ pairs.
â Gareth Rees
Jun 9 at 8:27
Great lessons about gc and cKDTree! my respect and thanks!
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:20
I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:40
I have found my mistake. The variable "mynumbers" was empty. Thanks again.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 17:05
1
@MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that asr
goes from0 â âÂÂ
the time complexity will go fromÃÂ(n) â ÃÂ(n^2)
â Aaron
Jun 10 at 20:56
 |Â
show 3 more comments
up vote
10
down vote
Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy
instead of rtree
. I also would like to point out a few other minutia I think are important or worth knowing.
1. import gc
import gc
You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.
2. itertools.combinations(points, 2)
itertools.combinations(points, 2)
This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree
or cKDTree
to accomplish a similar task to rtree
. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree
, they are implemented as a pre-compiled c library for much better speed than pure python.
In particular I'd like to use KDTree.query_pairs
to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j)
where i < j
. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.
Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple
and numpy.array
in a few places for it to work perfectly...
#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set
3. for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
This may be nit-picky of me, but using a range()
to iterate over something that is already iterable is something that could be written more concisely and beautifully:
for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)
Nice answer! I consideredKDTree.query_pairs
but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, thenKDTree.query_pairs
finds $Theta(n^2)$ pairs.
â Gareth Rees
Jun 9 at 8:27
Great lessons about gc and cKDTree! my respect and thanks!
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:20
I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:40
I have found my mistake. The variable "mynumbers" was empty. Thanks again.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 17:05
1
@MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that asr
goes from0 â âÂÂ
the time complexity will go fromÃÂ(n) â ÃÂ(n^2)
â Aaron
Jun 10 at 20:56
 |Â
show 3 more comments
up vote
10
down vote
up vote
10
down vote
Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy
instead of rtree
. I also would like to point out a few other minutia I think are important or worth knowing.
1. import gc
import gc
You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.
2. itertools.combinations(points, 2)
itertools.combinations(points, 2)
This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree
or cKDTree
to accomplish a similar task to rtree
. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree
, they are implemented as a pre-compiled c library for much better speed than pure python.
In particular I'd like to use KDTree.query_pairs
to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j)
where i < j
. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.
Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple
and numpy.array
in a few places for it to work perfectly...
#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set
3. for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
This may be nit-picky of me, but using a range()
to iterate over something that is already iterable is something that could be written more concisely and beautifully:
for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)
Gareth provided some excellent ideas, for which I will suggest a very similar solution, only using scipy
instead of rtree
. I also would like to point out a few other minutia I think are important or worth knowing.
1. import gc
import gc
You should also never alter the garbage collection unless you absolutely know you need to. Don't use it as a "fix" unless you absolutely know that inaccessible circular references (basically the only genuine use case) are your problem. Even then, prefer to use weak references to break reference loops and let reference counting take care of cleanup rather than garbage collection. Also, 300k 3d vectors is not so much data to a modern computer, so we can afford to keep a few copies in memory temporarily.
2. itertools.combinations(points, 2)
itertools.combinations(points, 2)
This line combined with only removing one point per function call is what gives you awful time complexity. I would suggest using scipy.spatial.KDTree
or cKDTree
to accomplish a similar task to rtree
. My reasoning for this is that some very useful clustering and neighbor matching algorithms are already implemented for you, and in the case of cKDTree
, they are implemented as a pre-compiled c library for much better speed than pure python.
In particular I'd like to use KDTree.query_pairs
to generate a set of all point pairs that are closer than a given radius. Once you have that set you can simply iterate over the pairs, and maintain a set of nodes to keep and a set of nodes to remove. This will not however preserve the exact same nodes as your reference implementation, as order of iteration will not be guaranteed (points kept will depend on what comes first), and there will be a bias towards a particular direction as the pairs are reported as (i,j)
where i < j
. That's not to say that your initial method introduces no bias as well, as you sort the input list, so this may not be a problem.
Alternatively you could convert the set of pairs into a dictionary of neighbors where each point is a key, with the value being a set of neighboring points. Then you could take your original list of points in the order you desire and use the dictionary to find and eliminate neighboring points. This should give you the same results as your initial implementation, but give much better speed. I have not tested this so you may find you need to cast between tuple
and numpy.array
in a few places for it to work perfectly...
#convert inner lists to tuples because `set()` doesn't play nice with lists
#you would probably be better off doing this at creation while reading the file.
mynumbers = [tuple(point) for point in mynumbers]
tree = scipy.spatial.cKDTree(mynumbers) #build k-dimensional trie
pairs = tree.query_pairs(r) #find all pairs closer than radius: r
neighbors = #dictionary of neighbors
for i,j in pairs: #iterate over all pairs
neighbors.get(i,set()).add(j)
neighbors.get(j,set()).add(i)
keep =
discard = set() #a list would work, but I use a set for fast member testing with `in`
for node in mynumbers:
if node not in discard: #if node already in discard set: skip
keep.append(node) #add node to keep list
discard.update(neighbors.get(node,set())) #add node's neighbors to discard set
3. for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
for j in range(len(mynumbers)):
fp2.write(' '.join(map(repr, mynumbers[j]))+'n')
This may be nit-picky of me, but using a range()
to iterate over something that is already iterable is something that could be written more concisely and beautifully:
for point in mynumbers:
#simpler and more concise string building
#this can also be more easily modified using padding, length, and precision
# modifiers to make the output file more easily readable (constant width columns)
fp2.write(' n'.format(*point)
answered Jun 8 at 21:30
Aaron
2014
2014
Nice answer! I consideredKDTree.query_pairs
but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, thenKDTree.query_pairs
finds $Theta(n^2)$ pairs.
â Gareth Rees
Jun 9 at 8:27
Great lessons about gc and cKDTree! my respect and thanks!
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:20
I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:40
I have found my mistake. The variable "mynumbers" was empty. Thanks again.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 17:05
1
@MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that asr
goes from0 â âÂÂ
the time complexity will go fromÃÂ(n) â ÃÂ(n^2)
â Aaron
Jun 10 at 20:56
 |Â
show 3 more comments
Nice answer! I consideredKDTree.query_pairs
but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, thenKDTree.query_pairs
finds $Theta(n^2)$ pairs.
â Gareth Rees
Jun 9 at 8:27
Great lessons about gc and cKDTree! my respect and thanks!
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:20
I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:40
I have found my mistake. The variable "mynumbers" was empty. Thanks again.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 17:05
1
@MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that asr
goes from0 â âÂÂ
the time complexity will go fromÃÂ(n) â ÃÂ(n^2)
â Aaron
Jun 10 at 20:56
Nice answer! I considered
KDTree.query_pairs
but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, then KDTree.query_pairs
finds $Theta(n^2)$ pairs.â Gareth Rees
Jun 9 at 8:27
Nice answer! I considered
KDTree.query_pairs
but I couldn't see how to avoid the quadratic worst case: if there are $Theta(n)$ points all within $r$ of each other, then KDTree.query_pairs
finds $Theta(n^2)$ pairs.â Gareth Rees
Jun 9 at 8:27
Great lessons about gc and cKDTree! my respect and thanks!
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:20
Great lessons about gc and cKDTree! my respect and thanks!
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:20
I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:40
I implemented your code and I got an ValueError by tree = scipy.spatial.CKDTree(mynumbers). The ValueError is : Buffer has wrong number of dimensions (expected 2, got 1).
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 10:40
I have found my mistake. The variable "mynumbers" was empty. Thanks again.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 17:05
I have found my mistake. The variable "mynumbers" was empty. Thanks again.
â Mohamad Reza Salehi Sadaghiani
Jun 9 at 17:05
1
1
@MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that as
r
goes from 0 â âÂÂ
the time complexity will go from ÃÂ(n) â ÃÂ(n^2)
â Aaron
Jun 10 at 20:56
@MohamadRezaSalehiSadaghiani I would definitely expect my example to become slower as you increase the radius. This goes directly back to Gareth's comment yesterday in that as
r
goes from 0 â âÂÂ
the time complexity will go from ÃÂ(n) â ÃÂ(n^2)
â Aaron
Jun 10 at 20:56
 |Â
show 3 more comments
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f196104%2fremoving-neighbors-in-a-point-cloud%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
can you give a few lines of
points.txt
as sample data?â Maarten Fabré
Jun 8 at 9:36
can you use external libraries as
numpy
orpandas
?â Maarten Fabré
Jun 8 at 9:36
a bit more explanation on what
join_pair
needs to do would be welcome tooâ Maarten Fabré
Jun 8 at 9:37
A smaple of my points are added. Yes I can use numpy. The join_pair get the points and calculates the dist (from dist_function) and then compare it with the limit 3.6 meters. If the distance is less than limit this point will be deleted from the final output. At the end I get a point cloud that have points with a distance more than 3.6 meters.
â Mohamad Reza Salehi Sadaghiani
Jun 8 at 9:53
there is no need for this size of dat - is the last word supposed to be "data" or "that"?
â mkrieger1
Jun 9 at 7:58