Filter if any value in group is null

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
0
down vote

favorite

The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
Do you have other comments about the codestyle?

import sys
import pandas as pd
import numpy as np
from numpy.random import randint, rand, choice, permutation

ID = [value for sublist in 
 ((value for _ in range(length))
 for value, length in enumerate(randint(1, 10, 70000)))
 for value in sublist]

data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])

data['ID'] = ID

data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None

data = data.apply(permutation)

# Until here the data was only prepared to be similar to my actual data.
# The following command should be optimized.
data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())

edited May 11 at 7:13

asked May 11 at 7:07

mcocdawc

30419

add a commentÂ |Â

up vote
0
down vote

favorite

The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
Do you have other comments about the codestyle?

import sys
import pandas as pd
import numpy as np
from numpy.random import randint, rand, choice, permutation

ID = [value for sublist in 
 ((value for _ in range(length))
 for value, length in enumerate(randint(1, 10, 70000)))
 for value in sublist]

data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])

data['ID'] = ID

data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None

data = data.apply(permutation)

# Until here the data was only prepared to be similar to my actual data.
# The following command should be optimized.
data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())

edited May 11 at 7:13

asked May 11 at 7:07

mcocdawc

30419

add a commentÂ |Â

up vote
0
down vote

favorite

The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
Do you have other comments about the codestyle?

import sys
import pandas as pd
import numpy as np
from numpy.random import randint, rand, choice, permutation

ID = [value for sublist in 
 ((value for _ in range(length))
 for value, length in enumerate(randint(1, 10, 70000)))
 for value in sublist]

data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])

data['ID'] = ID

data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None

data = data.apply(permutation)

# Until here the data was only prepared to be similar to my actual data.
# The following command should be optimized.
data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())

edited May 11 at 7:13

asked May 11 at 7:07

mcocdawc

30419

The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
Do you have other comments about the codestyle?

import sys
import pandas as pd
import numpy as np
from numpy.random import randint, rand, choice, permutation

ID = [value for sublist in 
 ((value for _ in range(length))
 for value, length in enumerate(randint(1, 10, 70000)))
 for value in sublist]

data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])

data['ID'] = ID

data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None

data = data.apply(permutation)

# Until here the data was only prepared to be similar to my actual data.
# The following command should be optimized.
data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())

edited May 11 at 7:13

asked May 11 at 7:07

mcocdawc

30419

edited May 11 at 7:13

asked May 11 at 7:07

mcocdawc

30419

asked May 11 at 7:07

mcocdawc

30419

asked May 11 at 7:07

mcocdawc

30419

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.

One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.

So you could find IDs containing nans,

contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()

Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:

ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]

Benchmarking this method results in:

81.1 ms Ã‚Â± 5.46 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 10 loops each)

versus using filter:

17.7 s Ã‚Â± 396 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 1 loop each)

answered May 11 at 9:17

mochi

1,0397

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194170%2ffilter-if-any-value-in-group-is-null%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.

One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.

So you could find IDs containing nans,

contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()

Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:

ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]

Benchmarking this method results in:

81.1 ms Ã‚Â± 5.46 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 10 loops each)

versus using filter:

17.7 s Ã‚Â± 396 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 1 loop each)

answered May 11 at 9:17

mochi

1,0397

add a commentÂ |Â

up vote
1
down vote

accepted

The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.

One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.

So you could find IDs containing nans,

contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()

Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:

ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]

Benchmarking this method results in:

81.1 ms Ã‚Â± 5.46 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 10 loops each)

versus using filter:

17.7 s Ã‚Â± 396 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 1 loop each)

answered May 11 at 9:17

mochi

1,0397

add a commentÂ |Â

up vote
1
down vote

accepted

The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.

One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.

So you could find IDs containing nans,

contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()

Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:

ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]

Benchmarking this method results in:

81.1 ms Ã‚Â± 5.46 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 10 loops each)

versus using filter:

17.7 s Ã‚Â± 396 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 1 loop each)

answered May 11 at 9:17

mochi

1,0397

The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.

One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.

So you could find IDs containing nans,

contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()

Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:

ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]

Benchmarking this method results in:

81.1 ms Ã‚Â± 5.46 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 10 loops each)

versus using filter:

17.7 s Ã‚Â± 396 ms per loop (mean Ã‚Â± std. dev. of 7 runs, 1 loop each)

answered May 11 at 9:17

mochi

1,0397

answered May 11 at 9:17

mochi

1,0397

answered May 11 at 9:17

mochi

1,0397

answered May 11 at 9:17

mochi

1,0397

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr