Filter if any value in group is null

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
0
down vote

favorite












The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
Do you have other comments about the codestyle?



import sys
import pandas as pd
import numpy as np
from numpy.random import randint, rand, choice, permutation

ID = [value for sublist in
((value for _ in range(length))
for value, length in enumerate(randint(1, 10, 70000)))
for value in sublist]

data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])

data['ID'] = ID

data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None

data = data.apply(permutation)

# Until here the data was only prepared to be similar to my actual data.
# The following command should be optimized.
data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())






share|improve this question



























    up vote
    0
    down vote

    favorite












    The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
    Do you have other comments about the codestyle?



    import sys
    import pandas as pd
    import numpy as np
    from numpy.random import randint, rand, choice, permutation

    ID = [value for sublist in
    ((value for _ in range(length))
    for value, length in enumerate(randint(1, 10, 70000)))
    for value in sublist]

    data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])

    data['ID'] = ID

    data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
    data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None

    data = data.apply(permutation)

    # Until here the data was only prepared to be similar to my actual data.
    # The following command should be optimized.
    data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())






    share|improve this question























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
      Do you have other comments about the codestyle?



      import sys
      import pandas as pd
      import numpy as np
      from numpy.random import randint, rand, choice, permutation

      ID = [value for sublist in
      ((value for _ in range(length))
      for value, length in enumerate(randint(1, 10, 70000)))
      for value in sublist]

      data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])

      data['ID'] = ID

      data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
      data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None

      data = data.apply(permutation)

      # Until here the data was only prepared to be similar to my actual data.
      # The following command should be optimized.
      data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())






      share|improve this question













      The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
      Do you have other comments about the codestyle?



      import sys
      import pandas as pd
      import numpy as np
      from numpy.random import randint, rand, choice, permutation

      ID = [value for sublist in
      ((value for _ in range(length))
      for value, length in enumerate(randint(1, 10, 70000)))
      for value in sublist]

      data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])

      data['ID'] = ID

      data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
      data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None

      data = data.apply(permutation)

      # Until here the data was only prepared to be similar to my actual data.
      # The following command should be optimized.
      data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())








      share|improve this question












      share|improve this question




      share|improve this question








      edited May 11 at 7:13
























      asked May 11 at 7:07









      mcocdawc

      30419




      30419




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.



          One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.



          So you could find IDs containing nans,



          contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()


          Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:



          ids_without_nan = contains_nan[~contains_nan].index
          data = data[data.ID.isin(ids_without_nan)]


          Benchmarking this method results in:



          81.1 ms ± 5.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


          versus using filter:



          17.7 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





          share|improve this answer





















            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "196"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );








             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194170%2ffilter-if-any-value-in-group-is-null%23new-answer', 'question_page');

            );

            Post as a guest






























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote



            accepted










            The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.



            One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.



            So you could find IDs containing nans,



            contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()


            Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:



            ids_without_nan = contains_nan[~contains_nan].index
            data = data[data.ID.isin(ids_without_nan)]


            Benchmarking this method results in:



            81.1 ms ± 5.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


            versus using filter:



            17.7 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





            share|improve this answer

























              up vote
              1
              down vote



              accepted










              The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.



              One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.



              So you could find IDs containing nans,



              contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()


              Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:



              ids_without_nan = contains_nan[~contains_nan].index
              data = data[data.ID.isin(ids_without_nan)]


              Benchmarking this method results in:



              81.1 ms ± 5.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


              versus using filter:



              17.7 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





              share|improve this answer























                up vote
                1
                down vote



                accepted







                up vote
                1
                down vote



                accepted






                The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.



                One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.



                So you could find IDs containing nans,



                contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()


                Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:



                ids_without_nan = contains_nan[~contains_nan].index
                data = data[data.ID.isin(ids_without_nan)]


                Benchmarking this method results in:



                81.1 ms ± 5.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


                versus using filter:



                17.7 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)





                share|improve this answer













                The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.



                One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.



                So you could find IDs containing nans,



                contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()


                Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:



                ids_without_nan = contains_nan[~contains_nan].index
                data = data[data.ID.isin(ids_without_nan)]


                Benchmarking this method results in:



                81.1 ms ± 5.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


                versus using filter:



                17.7 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)






                share|improve this answer













                share|improve this answer



                share|improve this answer











                answered May 11 at 9:17









                mochi

                1,0397




                1,0397






















                     

                    draft saved


                    draft discarded


























                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194170%2ffilter-if-any-value-in-group-is-null%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    Chat program with C++ and SFML

                    Function to Return a JSON Like Objects Using VBA Collections and Arrays

                    Will my employers contract hold up in court?