Average coverage between each sphere of each spheres groups
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
Let me start with a litle bit of context.
I have Spheres
, a large pandas DataFrame with positions and radius of multiple spheres during time.
(in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)
The spheres are grouped using a label. Multiple spheres can share the same label, and the same sphere can have multiple label through time.
(one label per sphere at start, one label for all the spheres at the end)
Moreover, those spheres can overlap each other, and I would like to quantify that for each group.
So I wrote a function compute_cov
that compute some representative quantity, that I can use with:
Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)
The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).
According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values
I already found that if I turn the "Time" index into its own column and sort:
Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])
groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values
which now take 65% of the time.
I first ask this question on stackoverflow, and someone suggested me to post here.
Since group
has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance
Any idea on how I can make it faster with pandas ?
(I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)
def compute_cov(group, with_cython=True):
"""
Each group contains a number of spheres (x,y,z,r),
I want to compute the mean coverage
"""
n = len(group)
# if only one sphere, no coverage
if n == 1:
return 0.
# this statement alone cost 65% !
data = group.values
# behind c_cov is a cython implementation of what is presented bellow
# the cython code is invisible to cProfile, so it's fast enough
if with_cython:
return c_cov(data)
"""
From here this is just to give you an idea of
what kind of computation I'm doing.
Again, the cython version is invisible to cProfile,
so that dosen't seems to be usefull to optimize
"""
# for two different spheres in the group
X1, X2 = np.triu_indices_from(data.T, k=1)
# renaming things for readability
_, x1, y1, z1, r1 = data[X1].T
_, x2, y2, z2, r2 = data[X2].T
# my definition of coverage
cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)
# ignoring negative values (no contact)
cov = cov[cov > 0]
# Averaging
if cov.size > 0:
res = cov.mean()
else:
res = 0
return res
And Spheres (file is here) looks like that:
Label Posx Posy Posz Radius
Time Num
0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
... ... ... ... ... ...
0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09
[9900 rows x 5 columns]
EDIT : add a link to a sample file on a github repo)
python performance pandas
add a comment |Â
up vote
2
down vote
favorite
Let me start with a litle bit of context.
I have Spheres
, a large pandas DataFrame with positions and radius of multiple spheres during time.
(in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)
The spheres are grouped using a label. Multiple spheres can share the same label, and the same sphere can have multiple label through time.
(one label per sphere at start, one label for all the spheres at the end)
Moreover, those spheres can overlap each other, and I would like to quantify that for each group.
So I wrote a function compute_cov
that compute some representative quantity, that I can use with:
Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)
The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).
According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values
I already found that if I turn the "Time" index into its own column and sort:
Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])
groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values
which now take 65% of the time.
I first ask this question on stackoverflow, and someone suggested me to post here.
Since group
has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance
Any idea on how I can make it faster with pandas ?
(I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)
def compute_cov(group, with_cython=True):
"""
Each group contains a number of spheres (x,y,z,r),
I want to compute the mean coverage
"""
n = len(group)
# if only one sphere, no coverage
if n == 1:
return 0.
# this statement alone cost 65% !
data = group.values
# behind c_cov is a cython implementation of what is presented bellow
# the cython code is invisible to cProfile, so it's fast enough
if with_cython:
return c_cov(data)
"""
From here this is just to give you an idea of
what kind of computation I'm doing.
Again, the cython version is invisible to cProfile,
so that dosen't seems to be usefull to optimize
"""
# for two different spheres in the group
X1, X2 = np.triu_indices_from(data.T, k=1)
# renaming things for readability
_, x1, y1, z1, r1 = data[X1].T
_, x2, y2, z2, r2 = data[X2].T
# my definition of coverage
cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)
# ignoring negative values (no contact)
cov = cov[cov > 0]
# Averaging
if cov.size > 0:
res = cov.mean()
else:
res = 0
return res
And Spheres (file is here) looks like that:
Label Posx Posy Posz Radius
Time Num
0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
... ... ... ... ... ...
0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09
[9900 rows x 5 columns]
EDIT : add a link to a sample file on a github repo)
python performance pandas
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
Let me start with a litle bit of context.
I have Spheres
, a large pandas DataFrame with positions and radius of multiple spheres during time.
(in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)
The spheres are grouped using a label. Multiple spheres can share the same label, and the same sphere can have multiple label through time.
(one label per sphere at start, one label for all the spheres at the end)
Moreover, those spheres can overlap each other, and I would like to quantify that for each group.
So I wrote a function compute_cov
that compute some representative quantity, that I can use with:
Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)
The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).
According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values
I already found that if I turn the "Time" index into its own column and sort:
Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])
groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values
which now take 65% of the time.
I first ask this question on stackoverflow, and someone suggested me to post here.
Since group
has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance
Any idea on how I can make it faster with pandas ?
(I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)
def compute_cov(group, with_cython=True):
"""
Each group contains a number of spheres (x,y,z,r),
I want to compute the mean coverage
"""
n = len(group)
# if only one sphere, no coverage
if n == 1:
return 0.
# this statement alone cost 65% !
data = group.values
# behind c_cov is a cython implementation of what is presented bellow
# the cython code is invisible to cProfile, so it's fast enough
if with_cython:
return c_cov(data)
"""
From here this is just to give you an idea of
what kind of computation I'm doing.
Again, the cython version is invisible to cProfile,
so that dosen't seems to be usefull to optimize
"""
# for two different spheres in the group
X1, X2 = np.triu_indices_from(data.T, k=1)
# renaming things for readability
_, x1, y1, z1, r1 = data[X1].T
_, x2, y2, z2, r2 = data[X2].T
# my definition of coverage
cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)
# ignoring negative values (no contact)
cov = cov[cov > 0]
# Averaging
if cov.size > 0:
res = cov.mean()
else:
res = 0
return res
And Spheres (file is here) looks like that:
Label Posx Posy Posz Radius
Time Num
0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
... ... ... ... ... ...
0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09
[9900 rows x 5 columns]
EDIT : add a link to a sample file on a github repo)
python performance pandas
Let me start with a litle bit of context.
I have Spheres
, a large pandas DataFrame with positions and radius of multiple spheres during time.
(in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)
The spheres are grouped using a label. Multiple spheres can share the same label, and the same sphere can have multiple label through time.
(one label per sphere at start, one label for all the spheres at the end)
Moreover, those spheres can overlap each other, and I would like to quantify that for each group.
So I wrote a function compute_cov
that compute some representative quantity, that I can use with:
Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)
The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).
According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values
I already found that if I turn the "Time" index into its own column and sort:
Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])
groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values
which now take 65% of the time.
I first ask this question on stackoverflow, and someone suggested me to post here.
Since group
has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance
Any idea on how I can make it faster with pandas ?
(I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)
def compute_cov(group, with_cython=True):
"""
Each group contains a number of spheres (x,y,z,r),
I want to compute the mean coverage
"""
n = len(group)
# if only one sphere, no coverage
if n == 1:
return 0.
# this statement alone cost 65% !
data = group.values
# behind c_cov is a cython implementation of what is presented bellow
# the cython code is invisible to cProfile, so it's fast enough
if with_cython:
return c_cov(data)
"""
From here this is just to give you an idea of
what kind of computation I'm doing.
Again, the cython version is invisible to cProfile,
so that dosen't seems to be usefull to optimize
"""
# for two different spheres in the group
X1, X2 = np.triu_indices_from(data.T, k=1)
# renaming things for readability
_, x1, y1, z1, r1 = data[X1].T
_, x2, y2, z2, r2 = data[X2].T
# my definition of coverage
cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)
# ignoring negative values (no contact)
cov = cov[cov > 0]
# Averaging
if cov.size > 0:
res = cov.mean()
else:
res = 0
return res
And Spheres (file is here) looks like that:
Label Posx Posy Posz Radius
Time Num
0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
... ... ... ... ... ...
0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09
[9900 rows x 5 columns]
EDIT : add a link to a sample file on a github repo)
python performance pandas
edited May 3 at 8:43
asked May 1 at 8:05
pums974
112
112
add a comment |Â
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193333%2faverage-coverage-between-each-sphere-of-each-spheres-groups%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password