Filter if any value in group is null
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
0
down vote
favorite
The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
Do you have other comments about the codestyle?
import sys
import pandas as pd
import numpy as np
from numpy.random import randint, rand, choice, permutation
ID = [value for sublist in
((value for _ in range(length))
for value, length in enumerate(randint(1, 10, 70000)))
for value in sublist]
data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])
data['ID'] = ID
data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None
data = data.apply(permutation)
# Until here the data was only prepared to be similar to my actual data.
# The following command should be optimized.
data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())
python performance pandas
add a comment |Â
up vote
0
down vote
favorite
The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
Do you have other comments about the codestyle?
import sys
import pandas as pd
import numpy as np
from numpy.random import randint, rand, choice, permutation
ID = [value for sublist in
((value for _ in range(length))
for value, length in enumerate(randint(1, 10, 70000)))
for value in sublist]
data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])
data['ID'] = ID
data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None
data = data.apply(permutation)
# Until here the data was only prepared to be similar to my actual data.
# The following command should be optimized.
data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())
python performance pandas
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
Do you have other comments about the codestyle?
import sys
import pandas as pd
import numpy as np
from numpy.random import randint, rand, choice, permutation
ID = [value for sublist in
((value for _ in range(length))
for value, length in enumerate(randint(1, 10, 70000)))
for value in sublist]
data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])
data['ID'] = ID
data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None
data = data.apply(permutation)
# Until here the data was only prepared to be similar to my actual data.
# The following command should be optimized.
data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())
python performance pandas
The following code works, but the filter operation is bottlenecking my analysis. Do you have an idea how I can make the filter operation faster?
Do you have other comments about the codestyle?
import sys
import pandas as pd
import numpy as np
from numpy.random import randint, rand, choice, permutation
ID = [value for sublist in
((value for _ in range(length))
for value, length in enumerate(randint(1, 10, 70000)))
for value in sublist]
data = pd.DataFrame(rand(len(ID), 3), columns=['A', 'B', 'C'])
data['ID'] = ID
data['dt'] = randint(0, sys.maxsize, len(data)).astype('M8[ns]')
data.loc[choice([True, False], len(data), [0.05, 0.95]), 'dt'] = None
data = data.apply(permutation)
# Until here the data was only prepared to be similar to my actual data.
# The following command should be optimized.
data.groupby('ID').filter(lambda x: not x['dt'].isnull().any())
python performance pandas
edited May 11 at 7:13
asked May 11 at 7:07
mcocdawc
30419
30419
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.
One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.
So you could find IDs containing nans,
contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()
Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:
ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]
Benchmarking this method results in:
81.1 ms ñ 5.46 ms per loop (mean ñ std. dev. of 7 runs, 10 loops each)
versus using filter:
17.7 s ñ 396 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.
One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.
So you could find IDs containing nans,
contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()
Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:
ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]
Benchmarking this method results in:
81.1 ms ñ 5.46 ms per loop (mean ñ std. dev. of 7 runs, 10 loops each)
versus using filter:
17.7 s ñ 396 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
add a comment |Â
up vote
1
down vote
accepted
The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.
One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.
So you could find IDs containing nans,
contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()
Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:
ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]
Benchmarking this method results in:
81.1 ms ñ 5.46 ms per loop (mean ñ std. dev. of 7 runs, 10 loops each)
versus using filter:
17.7 s ñ 396 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.
One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.
So you could find IDs containing nans,
contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()
Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:
ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]
Benchmarking this method results in:
81.1 ms ñ 5.46 ms per loop (mean ñ std. dev. of 7 runs, 10 loops each)
versus using filter:
17.7 s ñ 396 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
The reason why the filter is slow is because by calling lambda, the groupby objects are then iterated over in Python. In this case there are 70000 ID groups and so that can take some time.
One way to avoid this would be to count the number of non-nan values and the number of total values per ID in pandas, then mask your data like that. This keeps everything vectorized in Pandas.
So you could find IDs containing nans,
contains_nan = data.groupby("ID")["dt"].count() != data.ID.value_counts().sort_index()
Note that to compare the two series, the indices for the two series must be the same, so we sort it. Then you can filter just IDs not containing nan and apply the mask:
ids_without_nan = contains_nan[~contains_nan].index
data = data[data.ID.isin(ids_without_nan)]
Benchmarking this method results in:
81.1 ms ñ 5.46 ms per loop (mean ñ std. dev. of 7 runs, 10 loops each)
versus using filter:
17.7 s ñ 396 ms per loop (mean ñ std. dev. of 7 runs, 1 loop each)
answered May 11 at 9:17
mochi
1,0397
1,0397
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194170%2ffilter-if-any-value-in-group-is-null%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password