Average coverage between each sphere of each spheres groups

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

Let me start with a litle bit of context.

I have Spheres, a large pandas DataFrame with positions and radius of multiple spheres during time.

(in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)

The spheres are grouped using a label. Multiple spheres can share the same label, and the same sphere can have multiple label through time.

(one label per sphere at start, one label for all the spheres at the end)

Moreover, those spheres can overlap each other, and I would like to quantify that for each group.

So I wrote a function compute_cov that compute some representative quantity, that I can use with:

Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)

The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).

According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values

I already found that if I turn the "Time" index into its own column and sort:

Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])

groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values which now take 65% of the time.

I first ask this question on stackoverflow, and someone suggested me to post here.

Since group has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance

Any idea on how I can make it faster with pandas ?

(I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)

def compute_cov(group, with_cython=True):
 """
 Each group contains a number of spheres (x,y,z,r),
 I want to compute the mean coverage
 """

 n = len(group)

 # if only one sphere, no coverage
 if n == 1:
 return 0.

 # this statement alone cost 65% !
 data = group.values

 # behind c_cov is a cython implementation of what is presented bellow
 # the cython code is invisible to cProfile, so it's fast enough
 if with_cython:
 return c_cov(data)

 """
 From here this is just to give you an idea of
 what kind of computation I'm doing.

 Again, the cython version is invisible to cProfile,
 so that dosen't seems to be usefull to optimize
 """

 # for two different spheres in the group
 X1, X2 = np.triu_indices_from(data.T, k=1)

 # renaming things for readability
 _, x1, y1, z1, r1 = data[X1].T
 _, x2, y2, z2, r2 = data[X2].T

 # my definition of coverage
 cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)

 # ignoring negative values (no contact)
 cov = cov[cov > 0]

 # Averaging
 if cov.size > 0:
 res = cov.mean()
 else:
 res = 0

 return res

And Spheres (file is here) looks like that:

 Label Posx Posy Posz Radius
Time Num 
0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
 1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
 2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
 3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
 4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
 5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
 6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
 7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
 8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
 9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
 10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
 11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
 12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
 13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
 14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
 15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
 16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
 17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
 18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
 19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
 20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
 21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
 22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
 23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
 24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
 25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
 26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
 27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
 28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
 29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
... ... ... ... ... ...
0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
 71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
 72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
 73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
 74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
 75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
 76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
 77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
 78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
 79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
 80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
 81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
 82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
 83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
 84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
 85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
 86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
 87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
 88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
 89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
 90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
 91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
 92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
 93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
 94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
 95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
 96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
 97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
 98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
 99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09

[9900 rows x 5 columns]

EDIT : add a link to a sample file on a github repo)

edited May 3 at 8:43

asked May 1 at 8:05

pums974

112

add a commentÂ |Â

up vote
2
down vote

favorite

Let me start with a litle bit of context.

I have Spheres, a large pandas DataFrame with positions and radius of multiple spheres during time.

(in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)

Moreover, those spheres can overlap each other, and I would like to quantify that for each group.

So I wrote a function compute_cov that compute some representative quantity, that I can use with:

Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)

The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).

According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values

I already found that if I turn the "Time" index into its own column and sort:

Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])

groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values which now take 65% of the time.

I first ask this question on stackoverflow, and someone suggested me to post here.

Since group has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance

Any idea on how I can make it faster with pandas ?

(I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)

def compute_cov(group, with_cython=True):
 """
 Each group contains a number of spheres (x,y,z,r),
 I want to compute the mean coverage
 """

 n = len(group)

 # if only one sphere, no coverage
 if n == 1:
 return 0.

 # this statement alone cost 65% !
 data = group.values

 # behind c_cov is a cython implementation of what is presented bellow
 # the cython code is invisible to cProfile, so it's fast enough
 if with_cython:
 return c_cov(data)

 """
 From here this is just to give you an idea of
 what kind of computation I'm doing.

 Again, the cython version is invisible to cProfile,
 so that dosen't seems to be usefull to optimize
 """

 # for two different spheres in the group
 X1, X2 = np.triu_indices_from(data.T, k=1)

 # renaming things for readability
 _, x1, y1, z1, r1 = data[X1].T
 _, x2, y2, z2, r2 = data[X2].T

 # my definition of coverage
 cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)

 # ignoring negative values (no contact)
 cov = cov[cov > 0]

 # Averaging
 if cov.size > 0:
 res = cov.mean()
 else:
 res = 0

 return res

And Spheres (file is here) looks like that:

 Label Posx Posy Posz Radius
Time Num 
0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
 1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
 2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
 3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
 4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
 5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
 6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
 7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
 8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
 9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
 10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
 11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
 12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
 13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
 14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
 15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
 16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
 17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
 18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
 19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
 20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
 21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
 22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
 23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
 24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
 25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
 26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
 27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
 28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
 29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
... ... ... ... ... ...
0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
 71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
 72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
 73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
 74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
 75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
 76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
 77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
 78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
 79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
 80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
 81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
 82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
 83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
 84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
 85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
 86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
 87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
 88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
 89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
 90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
 91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
 92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
 93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
 94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
 95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
 96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
 97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
 98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
 99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09

[9900 rows x 5 columns]

EDIT : add a link to a sample file on a github repo)

edited May 3 at 8:43

asked May 1 at 8:05

pums974

112

add a commentÂ |Â

up vote
2
down vote

favorite

Let me start with a litle bit of context.

I have Spheres, a large pandas DataFrame with positions and radius of multiple spheres during time.

(in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)

Moreover, those spheres can overlap each other, and I would like to quantify that for each group.

So I wrote a function compute_cov that compute some representative quantity, that I can use with:

Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)

The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).

According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values

I already found that if I turn the "Time" index into its own column and sort:

Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])

groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values which now take 65% of the time.

I first ask this question on stackoverflow, and someone suggested me to post here.

Since group has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance

Any idea on how I can make it faster with pandas ?

(I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)

def compute_cov(group, with_cython=True):
 """
 Each group contains a number of spheres (x,y,z,r),
 I want to compute the mean coverage
 """

 n = len(group)

 # if only one sphere, no coverage
 if n == 1:
 return 0.

 # this statement alone cost 65% !
 data = group.values

 # behind c_cov is a cython implementation of what is presented bellow
 # the cython code is invisible to cProfile, so it's fast enough
 if with_cython:
 return c_cov(data)

 """
 From here this is just to give you an idea of
 what kind of computation I'm doing.

 Again, the cython version is invisible to cProfile,
 so that dosen't seems to be usefull to optimize
 """

 # for two different spheres in the group
 X1, X2 = np.triu_indices_from(data.T, k=1)

 # renaming things for readability
 _, x1, y1, z1, r1 = data[X1].T
 _, x2, y2, z2, r2 = data[X2].T

 # my definition of coverage
 cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)

 # ignoring negative values (no contact)
 cov = cov[cov > 0]

 # Averaging
 if cov.size > 0:
 res = cov.mean()
 else:
 res = 0

 return res

And Spheres (file is here) looks like that:

 Label Posx Posy Posz Radius
Time Num 
0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
 1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
 2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
 3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
 4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
 5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
 6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
 7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
 8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
 9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
 10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
 11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
 12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
 13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
 14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
 15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
 16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
 17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
 18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
 19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
 20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
 21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
 22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
 23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
 24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
 25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
 26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
 27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
 28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
 29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
... ... ... ... ... ...
0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
 71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
 72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
 73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
 74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
 75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
 76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
 77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
 78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
 79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
 80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
 81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
 82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
 83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
 84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
 85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
 86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
 87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
 88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
 89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
 90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
 91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
 92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
 93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
 94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
 95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
 96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
 97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
 98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
 99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09

[9900 rows x 5 columns]

EDIT : add a link to a sample file on a github repo)

edited May 3 at 8:43

asked May 1 at 8:05

pums974

112

Let me start with a litle bit of context.

I have Spheres, a large pandas DataFrame with positions and radius of multiple spheres during time.

(in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)

Moreover, those spheres can overlap each other, and I would like to quantify that for each group.

So I wrote a function compute_cov that compute some representative quantity, that I can use with:

Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)

The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).

According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values

I already found that if I turn the "Time" index into its own column and sort:

Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])

groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values which now take 65% of the time.

I first ask this question on stackoverflow, and someone suggested me to post here.

Since group has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance

Any idea on how I can make it faster with pandas ?

(I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)

def compute_cov(group, with_cython=True):
 """
 Each group contains a number of spheres (x,y,z,r),
 I want to compute the mean coverage
 """

 n = len(group)

 # if only one sphere, no coverage
 if n == 1:
 return 0.

 # this statement alone cost 65% !
 data = group.values

 # behind c_cov is a cython implementation of what is presented bellow
 # the cython code is invisible to cProfile, so it's fast enough
 if with_cython:
 return c_cov(data)

 """
 From here this is just to give you an idea of
 what kind of computation I'm doing.

 Again, the cython version is invisible to cProfile,
 so that dosen't seems to be usefull to optimize
 """

 # for two different spheres in the group
 X1, X2 = np.triu_indices_from(data.T, k=1)

 # renaming things for readability
 _, x1, y1, z1, r1 = data[X1].T
 _, x2, y2, z2, r2 = data[X2].T

 # my definition of coverage
 cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)

 # ignoring negative values (no contact)
 cov = cov[cov > 0]

 # Averaging
 if cov.size > 0:
 res = cov.mean()
 else:
 res = 0

 return res

And Spheres (file is here) looks like that:

 Label Posx Posy Posz Radius
Time Num 
0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
 1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
 2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
 3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
 4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
 5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
 6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
 7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
 8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
 9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
 10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
 11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
 12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
 13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
 14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
 15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
 16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
 17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
 18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
 19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
 20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
 21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
 22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
 23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
 24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
 25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
 26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
 27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
 28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
 29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
... ... ... ... ... ...
0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
 71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
 72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
 73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
 74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
 75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
 76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
 77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
 78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
 79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
 80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
 81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
 82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
 83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
 84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
 85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
 86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
 87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
 88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
 89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
 90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
 91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
 92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
 93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
 94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
 95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
 96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
 97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
 98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
 99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09

[9900 rows x 5 columns]

EDIT : add a link to a sample file on a github repo)

edited May 3 at 8:43

asked May 1 at 8:05

pums974

112

edited May 3 at 8:43

asked May 1 at 8:05

pums974

112

asked May 1 at 8:05

pums974

112

asked May 1 at 8:05

pums974

112

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193333%2faverage-coverage-between-each-sphere-of-each-spheres-groups%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr