Compute conditional median of PANDAS dataframe
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
3
down vote
favorite
I am new to Python/Pandas.
Consider the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame('Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19])
print(df)
Output:
A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0
I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:
df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)
Output:
A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0
My problem is now to compute another feature, Feature_2, which for each row of the dataframe, compute the median of column A for OTHER values which have the same Time value. I was not able to vectorize this, so my solution with a for loop:
df['feature_2'] = np.NaN
for i in range(len(df)):
current_Id = df.Id[i]
current_time = df.Time[i]
idx = (df.Time == current_time) & (df.Id != current_Id)
if idx.any():
df['feature_2'][i] = df.A[idx].median()
print(df)
Output:
A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0
This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).
Is it possible to vectorize the creation of Feature_2 ?
I hope, I am clear enough. Live code can be found here.
python performance statistics pandas
add a comment |Â
up vote
3
down vote
favorite
I am new to Python/Pandas.
Consider the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame('Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19])
print(df)
Output:
A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0
I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:
df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)
Output:
A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0
My problem is now to compute another feature, Feature_2, which for each row of the dataframe, compute the median of column A for OTHER values which have the same Time value. I was not able to vectorize this, so my solution with a for loop:
df['feature_2'] = np.NaN
for i in range(len(df)):
current_Id = df.Id[i]
current_time = df.Time[i]
idx = (df.Time == current_time) & (df.Id != current_Id)
if idx.any():
df['feature_2'][i] = df.A[idx].median()
print(df)
Output:
A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0
This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).
Is it possible to vectorize the creation of Feature_2 ?
I hope, I am clear enough. Live code can be found here.
python performance statistics pandas
add a comment |Â
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I am new to Python/Pandas.
Consider the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame('Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19])
print(df)
Output:
A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0
I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:
df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)
Output:
A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0
My problem is now to compute another feature, Feature_2, which for each row of the dataframe, compute the median of column A for OTHER values which have the same Time value. I was not able to vectorize this, so my solution with a for loop:
df['feature_2'] = np.NaN
for i in range(len(df)):
current_Id = df.Id[i]
current_time = df.Time[i]
idx = (df.Time == current_time) & (df.Id != current_Id)
if idx.any():
df['feature_2'][i] = df.A[idx].median()
print(df)
Output:
A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0
This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).
Is it possible to vectorize the creation of Feature_2 ?
I hope, I am clear enough. Live code can be found here.
python performance statistics pandas
I am new to Python/Pandas.
Consider the following code:
import pandas as pd
import numpy as np
df = pd.DataFrame('Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19])
print(df)
Output:
A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0
I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:
df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)
Output:
A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0
My problem is now to compute another feature, Feature_2, which for each row of the dataframe, compute the median of column A for OTHER values which have the same Time value. I was not able to vectorize this, so my solution with a for loop:
df['feature_2'] = np.NaN
for i in range(len(df)):
current_Id = df.Id[i]
current_time = df.Time[i]
idx = (df.Time == current_time) & (df.Id != current_Id)
if idx.any():
df['feature_2'][i] = df.A[idx].median()
print(df)
Output:
A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0
This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).
Is it possible to vectorize the creation of Feature_2 ?
I hope, I am clear enough. Live code can be found here.
python performance statistics pandas
edited May 14 at 14:49
200_success
123k14143399
123k14143399
asked May 14 at 13:06
Romain
1184
1184
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
4
down vote
accepted
So, you want to get the medians of the groups by removing each value from the group in turn:
group => individual removal of values
NaN [ ] NaN NaN NaN
25.0 => 25.0 [ ] 25.0 25.0
15.0 15.0 15.0 [ ] 15.0
19.0 19.0 19.0 19.0 [ ]
median 19.0 19.0 17.0 22.0 20.0
An other way of doing, beside manually reconstructing the group without the current value for each value, is to build the above intermediate matrix and ask for the median on each column. This will return a Series
of length the length of the group, which is supported by SeriesGroupBy.transform
.
The steps to get the desired result are:
- build the matrix by repeating the input group as many time as its length;
fill the diagonal of the matrix withNaN
s;- ask for the median by row/column depending on how you built the matrix.
The function that can be fed to transform
may look like:
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:
import numpy as np
import pandas as pd
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
def compute_medians(dataframe, groups_column='Time', values_column='A'):
groups = dataframe.groupby(groups_column)[values_column]
dataframe['Feature_1'] = groups.transform(np.median)
dataframe['Feature_2'] = groups.transform(median_without_element)
if __name__ == '__main__':
df = pd.DataFrame(
'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
)
compute_medians(df)
print(df)
Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â Romain
May 14 at 16:44
@Romain Absolutely, fixed it.
â Mathias Ettinger
May 14 at 17:09
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
So, you want to get the medians of the groups by removing each value from the group in turn:
group => individual removal of values
NaN [ ] NaN NaN NaN
25.0 => 25.0 [ ] 25.0 25.0
15.0 15.0 15.0 [ ] 15.0
19.0 19.0 19.0 19.0 [ ]
median 19.0 19.0 17.0 22.0 20.0
An other way of doing, beside manually reconstructing the group without the current value for each value, is to build the above intermediate matrix and ask for the median on each column. This will return a Series
of length the length of the group, which is supported by SeriesGroupBy.transform
.
The steps to get the desired result are:
- build the matrix by repeating the input group as many time as its length;
fill the diagonal of the matrix withNaN
s;- ask for the median by row/column depending on how you built the matrix.
The function that can be fed to transform
may look like:
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:
import numpy as np
import pandas as pd
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
def compute_medians(dataframe, groups_column='Time', values_column='A'):
groups = dataframe.groupby(groups_column)[values_column]
dataframe['Feature_1'] = groups.transform(np.median)
dataframe['Feature_2'] = groups.transform(median_without_element)
if __name__ == '__main__':
df = pd.DataFrame(
'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
)
compute_medians(df)
print(df)
Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â Romain
May 14 at 16:44
@Romain Absolutely, fixed it.
â Mathias Ettinger
May 14 at 17:09
add a comment |Â
up vote
4
down vote
accepted
So, you want to get the medians of the groups by removing each value from the group in turn:
group => individual removal of values
NaN [ ] NaN NaN NaN
25.0 => 25.0 [ ] 25.0 25.0
15.0 15.0 15.0 [ ] 15.0
19.0 19.0 19.0 19.0 [ ]
median 19.0 19.0 17.0 22.0 20.0
An other way of doing, beside manually reconstructing the group without the current value for each value, is to build the above intermediate matrix and ask for the median on each column. This will return a Series
of length the length of the group, which is supported by SeriesGroupBy.transform
.
The steps to get the desired result are:
- build the matrix by repeating the input group as many time as its length;
fill the diagonal of the matrix withNaN
s;- ask for the median by row/column depending on how you built the matrix.
The function that can be fed to transform
may look like:
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:
import numpy as np
import pandas as pd
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
def compute_medians(dataframe, groups_column='Time', values_column='A'):
groups = dataframe.groupby(groups_column)[values_column]
dataframe['Feature_1'] = groups.transform(np.median)
dataframe['Feature_2'] = groups.transform(median_without_element)
if __name__ == '__main__':
df = pd.DataFrame(
'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
)
compute_medians(df)
print(df)
Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â Romain
May 14 at 16:44
@Romain Absolutely, fixed it.
â Mathias Ettinger
May 14 at 17:09
add a comment |Â
up vote
4
down vote
accepted
up vote
4
down vote
accepted
So, you want to get the medians of the groups by removing each value from the group in turn:
group => individual removal of values
NaN [ ] NaN NaN NaN
25.0 => 25.0 [ ] 25.0 25.0
15.0 15.0 15.0 [ ] 15.0
19.0 19.0 19.0 19.0 [ ]
median 19.0 19.0 17.0 22.0 20.0
An other way of doing, beside manually reconstructing the group without the current value for each value, is to build the above intermediate matrix and ask for the median on each column. This will return a Series
of length the length of the group, which is supported by SeriesGroupBy.transform
.
The steps to get the desired result are:
- build the matrix by repeating the input group as many time as its length;
fill the diagonal of the matrix withNaN
s;- ask for the median by row/column depending on how you built the matrix.
The function that can be fed to transform
may look like:
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:
import numpy as np
import pandas as pd
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
def compute_medians(dataframe, groups_column='Time', values_column='A'):
groups = dataframe.groupby(groups_column)[values_column]
dataframe['Feature_1'] = groups.transform(np.median)
dataframe['Feature_2'] = groups.transform(median_without_element)
if __name__ == '__main__':
df = pd.DataFrame(
'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
)
compute_medians(df)
print(df)
So, you want to get the medians of the groups by removing each value from the group in turn:
group => individual removal of values
NaN [ ] NaN NaN NaN
25.0 => 25.0 [ ] 25.0 25.0
15.0 15.0 15.0 [ ] 15.0
19.0 19.0 19.0 19.0 [ ]
median 19.0 19.0 17.0 22.0 20.0
An other way of doing, beside manually reconstructing the group without the current value for each value, is to build the above intermediate matrix and ask for the median on each column. This will return a Series
of length the length of the group, which is supported by SeriesGroupBy.transform
.
The steps to get the desired result are:
- build the matrix by repeating the input group as many time as its length;
fill the diagonal of the matrix withNaN
s;- ask for the median by row/column depending on how you built the matrix.
The function that can be fed to transform
may look like:
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:
import numpy as np
import pandas as pd
def median_without_element(group):
matrix = pd.DataFrame([group] * len(group))
np.fill_diagonal(matrix.values, np.NaN)
return matrix.median(axis=1)
def compute_medians(dataframe, groups_column='Time', values_column='A'):
groups = dataframe.groupby(groups_column)[values_column]
dataframe['Feature_1'] = groups.transform(np.median)
dataframe['Feature_2'] = groups.transform(median_without_element)
if __name__ == '__main__':
df = pd.DataFrame(
'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
)
compute_medians(df)
print(df)
edited May 14 at 17:08
answered May 14 at 14:53
Mathias Ettinger
21.8k32875
21.8k32875
Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â Romain
May 14 at 16:44
@Romain Absolutely, fixed it.
â Mathias Ettinger
May 14 at 17:09
add a comment |Â
Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â Romain
May 14 at 16:44
@Romain Absolutely, fixed it.
â Mathias Ettinger
May 14 at 17:09
Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â Romain
May 14 at 16:44
Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â Romain
May 14 at 16:44
@Romain Absolutely, fixed it.
â Mathias Ettinger
May 14 at 17:09
@Romain Absolutely, fixed it.
â Mathias Ettinger
May 14 at 17:09
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194361%2fcompute-conditional-median-of-pandas-dataframe%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password