Average coverage between each sphere of each spheres groups

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite
1












Let me start with a litle bit of context.



I have Spheres, a large pandas DataFrame with positions and radius of multiple spheres during time.

(in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)



The spheres are grouped using a label. Multiple spheres can share the same label, and the same sphere can have multiple label through time.

(one label per sphere at start, one label for all the spheres at the end)



Moreover, those spheres can overlap each other, and I would like to quantify that for each group.



So I wrote a function compute_cov that compute some representative quantity, that I can use with:



Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)


The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).



According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values



I already found that if I turn the "Time" index into its own column and sort:



Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])


groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values which now take 65% of the time.



I first ask this question on stackoverflow, and someone suggested me to post here.



Since group has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance



Any idea on how I can make it faster with pandas ?



(I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)



def compute_cov(group, with_cython=True):
"""
Each group contains a number of spheres (x,y,z,r),
I want to compute the mean coverage
"""

n = len(group)

# if only one sphere, no coverage
if n == 1:
return 0.

# this statement alone cost 65% !
data = group.values

# behind c_cov is a cython implementation of what is presented bellow
# the cython code is invisible to cProfile, so it's fast enough
if with_cython:
return c_cov(data)

"""
From here this is just to give you an idea of
what kind of computation I'm doing.

Again, the cython version is invisible to cProfile,
so that dosen't seems to be usefull to optimize
"""

# for two different spheres in the group
X1, X2 = np.triu_indices_from(data.T, k=1)

# renaming things for readability
_, x1, y1, z1, r1 = data[X1].T
_, x2, y2, z2, r2 = data[X2].T

# my definition of coverage
cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)

# ignoring negative values (no contact)
cov = cov[cov > 0]

# Averaging
if cov.size > 0:
res = cov.mean()
else:
res = 0

return res


And Spheres (file is here) looks like that:



 Label Posx Posy Posz Radius
Time Num
0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
... ... ... ... ... ...
0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09

[9900 rows x 5 columns]


EDIT : add a link to a sample file on a github repo)







share|improve this question



























    up vote
    2
    down vote

    favorite
    1












    Let me start with a litle bit of context.



    I have Spheres, a large pandas DataFrame with positions and radius of multiple spheres during time.

    (in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)



    The spheres are grouped using a label. Multiple spheres can share the same label, and the same sphere can have multiple label through time.

    (one label per sphere at start, one label for all the spheres at the end)



    Moreover, those spheres can overlap each other, and I would like to quantify that for each group.



    So I wrote a function compute_cov that compute some representative quantity, that I can use with:



    Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)


    The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).



    According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values



    I already found that if I turn the "Time" index into its own column and sort:



    Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])


    groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values which now take 65% of the time.



    I first ask this question on stackoverflow, and someone suggested me to post here.



    Since group has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance



    Any idea on how I can make it faster with pandas ?



    (I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)



    def compute_cov(group, with_cython=True):
    """
    Each group contains a number of spheres (x,y,z,r),
    I want to compute the mean coverage
    """

    n = len(group)

    # if only one sphere, no coverage
    if n == 1:
    return 0.

    # this statement alone cost 65% !
    data = group.values

    # behind c_cov is a cython implementation of what is presented bellow
    # the cython code is invisible to cProfile, so it's fast enough
    if with_cython:
    return c_cov(data)

    """
    From here this is just to give you an idea of
    what kind of computation I'm doing.

    Again, the cython version is invisible to cProfile,
    so that dosen't seems to be usefull to optimize
    """

    # for two different spheres in the group
    X1, X2 = np.triu_indices_from(data.T, k=1)

    # renaming things for readability
    _, x1, y1, z1, r1 = data[X1].T
    _, x2, y2, z2, r2 = data[X2].T

    # my definition of coverage
    cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)

    # ignoring negative values (no contact)
    cov = cov[cov > 0]

    # Averaging
    if cov.size > 0:
    res = cov.mean()
    else:
    res = 0

    return res


    And Spheres (file is here) looks like that:



     Label Posx Posy Posz Radius
    Time Num
    0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
    1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
    2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
    3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
    4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
    5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
    6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
    7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
    8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
    9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
    10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
    11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
    12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
    13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
    14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
    15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
    16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
    17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
    18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
    19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
    20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
    21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
    22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
    23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
    24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
    25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
    26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
    27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
    28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
    29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
    ... ... ... ... ... ...
    0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
    71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
    72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
    73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
    74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
    75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
    76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
    77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
    78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
    79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
    80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
    81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
    82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
    83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
    84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
    85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
    86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
    87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
    88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
    89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
    90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
    91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
    92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
    93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
    94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
    95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
    96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
    97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
    98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
    99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09

    [9900 rows x 5 columns]


    EDIT : add a link to a sample file on a github repo)







    share|improve this question























      up vote
      2
      down vote

      favorite
      1









      up vote
      2
      down vote

      favorite
      1






      1





      Let me start with a litle bit of context.



      I have Spheres, a large pandas DataFrame with positions and radius of multiple spheres during time.

      (in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)



      The spheres are grouped using a label. Multiple spheres can share the same label, and the same sphere can have multiple label through time.

      (one label per sphere at start, one label for all the spheres at the end)



      Moreover, those spheres can overlap each other, and I would like to quantify that for each group.



      So I wrote a function compute_cov that compute some representative quantity, that I can use with:



      Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)


      The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).



      According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values



      I already found that if I turn the "Time" index into its own column and sort:



      Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])


      groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values which now take 65% of the time.



      I first ask this question on stackoverflow, and someone suggested me to post here.



      Since group has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance



      Any idea on how I can make it faster with pandas ?



      (I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)



      def compute_cov(group, with_cython=True):
      """
      Each group contains a number of spheres (x,y,z,r),
      I want to compute the mean coverage
      """

      n = len(group)

      # if only one sphere, no coverage
      if n == 1:
      return 0.

      # this statement alone cost 65% !
      data = group.values

      # behind c_cov is a cython implementation of what is presented bellow
      # the cython code is invisible to cProfile, so it's fast enough
      if with_cython:
      return c_cov(data)

      """
      From here this is just to give you an idea of
      what kind of computation I'm doing.

      Again, the cython version is invisible to cProfile,
      so that dosen't seems to be usefull to optimize
      """

      # for two different spheres in the group
      X1, X2 = np.triu_indices_from(data.T, k=1)

      # renaming things for readability
      _, x1, y1, z1, r1 = data[X1].T
      _, x2, y2, z2, r2 = data[X2].T

      # my definition of coverage
      cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)

      # ignoring negative values (no contact)
      cov = cov[cov > 0]

      # Averaging
      if cov.size > 0:
      res = cov.mean()
      else:
      res = 0

      return res


      And Spheres (file is here) looks like that:



       Label Posx Posy Posz Radius
      Time Num
      0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
      1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
      2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
      3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
      4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
      5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
      6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
      7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
      8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
      9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
      10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
      11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
      12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
      13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
      14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
      15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
      16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
      17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
      18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
      19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
      20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
      21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
      22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
      23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
      24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
      25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
      26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
      27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
      28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
      29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
      ... ... ... ... ... ...
      0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
      71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
      72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
      73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
      74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
      75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
      76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
      77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
      78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
      79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
      80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
      81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
      82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
      83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
      84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
      85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
      86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
      87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
      88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
      89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
      90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
      91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
      92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
      93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
      94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
      95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
      96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
      97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
      98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
      99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09

      [9900 rows x 5 columns]


      EDIT : add a link to a sample file on a github repo)







      share|improve this question













      Let me start with a litle bit of context.



      I have Spheres, a large pandas DataFrame with positions and radius of multiple spheres during time.

      (in real life 10 000 spheres and 10 000 time steps, for now 100 spheres and 100 time steps)



      The spheres are grouped using a label. Multiple spheres can share the same label, and the same sphere can have multiple label through time.

      (one label per sphere at start, one label for all the spheres at the end)



      Moreover, those spheres can overlap each other, and I would like to quantify that for each group.



      So I wrote a function compute_cov that compute some representative quantity, that I can use with:



      Spheres.groupby(by=["Time", "Label"]).apply(compute_cov)


      The problem I'm facing, is that this is too slow for what I need (again, real data is about 10 000x larger, this is already taking 1.3s).



      According to cProfile, around 82% of the time is spent inside groupby, and on the 13% time spent inside compute_cov, 10% alone are spent by group.values



      I already found that if I turn the "Time" index into its own column and sort:



      Spheres = Spheres.reset_index(0).sort_values(["Time",'Label'])


      groubpby is much faster (~5x, now it take 258ms). So now the main problem seems to be group.values which now take 65% of the time.



      I first ask this question on stackoverflow, and someone suggested me to post here.



      Since group has mixed datatype, someone also suggested to access its column independently. But this have almost no impact on performance



      Any idea on how I can make it faster with pandas ?



      (I'm currently trying to switch to dask, but I believe that any gain in pandas will be a gain in dask)



      def compute_cov(group, with_cython=True):
      """
      Each group contains a number of spheres (x,y,z,r),
      I want to compute the mean coverage
      """

      n = len(group)

      # if only one sphere, no coverage
      if n == 1:
      return 0.

      # this statement alone cost 65% !
      data = group.values

      # behind c_cov is a cython implementation of what is presented bellow
      # the cython code is invisible to cProfile, so it's fast enough
      if with_cython:
      return c_cov(data)

      """
      From here this is just to give you an idea of
      what kind of computation I'm doing.

      Again, the cython version is invisible to cProfile,
      so that dosen't seems to be usefull to optimize
      """

      # for two different spheres in the group
      X1, X2 = np.triu_indices_from(data.T, k=1)

      # renaming things for readability
      _, x1, y1, z1, r1 = data[X1].T
      _, x2, y2, z2, r2 = data[X2].T

      # my definition of coverage
      cov = 1 - np.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2) / (r1+r2)

      # ignoring negative values (no contact)
      cov = cov[cov > 0]

      # Averaging
      if cov.size > 0:
      res = cov.mean()
      else:
      res = 0

      return res


      And Spheres (file is here) looks like that:



       Label Posx Posy Posz Radius
      Time Num
      0.000000 0 0 3.386984e-07 1.589845e-07 3.156847e-07 6.025496e-09
      1 1 3.675054e-07 7.963736e-08 1.351358e-07 5.888543e-09
      2 2 1.119772e-07 2.233176e-07 1.924494e-07 5.380718e-09
      3 3 1.470528e-07 2.069633e-07 3.838650e-07 6.802969e-09
      4 4 2.562696e-07 2.891584e-07 5.708315e-08 5.312195e-09
      5 5 6.571124e-09 9.791307e-08 5.532111e-08 6.053221e-09
      6 6 6.316083e-08 1.616296e-07 5.232142e-08 3.797439e-09
      7 7 4.026887e-07 8.798422e-08 2.067745e-07 6.237204e-09
      8 8 2.469688e-07 1.193369e-07 2.570115e-07 5.068430e-09
      9 9 1.989743e-07 3.921473e-07 1.179200e-07 5.902088e-09
      10 10 2.123426e-07 3.103694e-07 1.613411e-07 6.586051e-09
      11 11 1.142105e-07 1.420838e-07 3.256118e-07 6.831307e-09
      12 12 2.811991e-08 3.826949e-07 2.120404e-07 3.686755e-09
      13 13 7.748568e-08 2.673616e-07 3.588726e-07 4.584994e-09
      14 14 2.586889e-08 8.071737e-09 1.845098e-07 3.554399e-09
      15 15 9.605596e-08 3.912842e-07 3.637002e-07 6.306579e-09
      16 16 1.074989e-07 2.175894e-07 1.512543e-07 5.854575e-09
      17 17 2.066144e-07 2.691743e-07 2.143024e-07 3.376725e-09
      18 18 1.764215e-07 3.756435e-07 3.752302e-07 5.698067e-09
      19 19 1.146050e-07 2.977196e-07 2.579897e-07 4.599236e-09
      20 20 2.772923e-07 6.690789e-08 1.774159e-07 6.499418e-09
      21 21 3.342694e-07 1.331663e-07 9.230217e-08 6.600707e-09
      22 22 1.412380e-07 2.768119e-07 3.855737e-07 5.256329e-09
      23 23 2.649739e-07 3.461516e-07 1.771964e-07 6.882931e-09
      24 24 1.606187e-07 3.284507e-07 2.758237e-07 6.752818e-09
      25 25 1.945027e-07 8.700385e-08 3.830679e-07 6.842569e-09
      26 26 5.952504e-08 3.551758e-07 2.584339e-07 4.812374e-09
      27 27 2.497732e-07 1.133013e-07 3.168550e-07 4.469074e-09
      28 28 1.802092e-07 9.114862e-08 7.559878e-08 4.379245e-09
      29 29 2.243149e-07 1.679009e-07 6.837240e-08 6.714596e-09
      ... ... ... ... ... ...
      0.000003 70 0 1.278495e-07 2.375712e-07 1.663126e-08 4.536631e-09
      71 1 3.660745e-07 1.562219e-07 1.063525e-07 6.830331e-09
      72 0 6.141226e-08 2.245705e-07 -3.504173e-08 5.570172e-09
      73 0 6.176349e-08 1.768351e-07 -1.878997e-08 6.803737e-09
      74 0 3.724008e-08 1.716644e-07 -2.092554e-08 5.136516e-09
      75 0 1.314168e-07 2.360284e-07 2.691397e-08 6.456112e-09
      76 0 5.845132e-08 2.155723e-07 -3.202164e-08 4.372447e-09
      77 0 6.260762e-08 1.898116e-07 -2.036060e-08 6.294658e-09
      78 0 5.870803e-08 1.600778e-07 -2.961800e-08 5.564551e-09
      79 0 9.130520e-08 2.381047e-07 -3.473163e-08 4.978849e-09
      80 1 3.959347e-07 1.558427e-07 1.019283e-07 4.214814e-09
      81 0 8.323550e-08 2.358459e-07 -3.005664e-08 4.616857e-09
      82 0 1.232102e-07 2.407576e-07 3.397732e-08 5.359298e-09
      83 0 5.662502e-08 2.118005e-07 -2.063705e-08 4.546367e-09
      84 0 1.135318e-07 2.240874e-07 -2.560423e-08 4.328089e-09
      85 0 7.204258e-08 2.010134e-07 -3.487838e-08 5.439786e-09
      86 0 1.278136e-07 2.104107e-07 2.828027e-10 3.712955e-09
      87 0 1.202827e-07 2.116802e-07 -1.142444e-08 4.347568e-09
      88 1 3.469586e-07 1.382176e-07 9.114768e-08 3.994887e-09
      89 1 3.763531e-07 1.490025e-07 9.602604e-08 4.169581e-09
      90 1 3.528888e-07 1.445890e-07 9.125105e-08 4.709859e-09
      91 0 1.327863e-07 1.984836e-07 -1.740811e-08 5.412026e-09
      92 0 7.726591e-08 1.933702e-07 -3.621201e-08 3.913367e-09
      93 0 1.122231e-07 2.435780e-07 -2.710722e-08 5.915332e-09
      94 0 1.085695e-07 2.327729e-07 -2.492152e-08 5.698270e-09
      95 0 1.369983e-07 2.549795e-07 -6.333421e-08 5.649468e-09
      96 0 1.430033e-07 1.995499e-07 -9.115494e-09 3.726830e-09
      97 0 9.940096e-08 2.317013e-07 2.647245e-09 5.472444e-09
      98 1 3.593535e-07 1.451526e-07 9.626210e-08 3.488982e-09
      99 0 1.526954e-07 2.533845e-07 -4.934458e-08 4.841371e-09

      [9900 rows x 5 columns]


      EDIT : add a link to a sample file on a github repo)









      share|improve this question












      share|improve this question




      share|improve this question








      edited May 3 at 8:43
























      asked May 1 at 8:05









      pums974

      112




      112

























          active

          oldest

          votes











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193333%2faverage-coverage-between-each-sphere-of-each-spheres-groups%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes










           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f193333%2faverage-coverage-between-each-sphere-of-each-spheres-groups%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Chat program with C++ and SFML

          Function to Return a JSON Like Objects Using VBA Collections and Arrays

          Will my employers contract hold up in court?