Compute conditional median of PANDAS dataframe

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
3
down vote

favorite

I am new to Python/Pandas.
Consider the following code:

import pandas as pd
import numpy as np

df = pd.DataFrame('Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19])
print(df)

Output:

 A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0

I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:

df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)

Output:

 A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0

My problem is now to compute another feature, Feature_2, which for each row of the dataframe, compute the median of column A for OTHER values which have the same Time value. I was not able to vectorize this, so my solution with a for loop:

df['feature_2'] = np.NaN

for i in range(len(df)):

 current_Id = df.Id[i]
 current_time = df.Time[i]

 idx = (df.Time == current_time) & (df.Id != current_Id)

 if idx.any():
 df['feature_2'][i] = df.A[idx].median()

print(df)

Output:

 A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0

This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).

Is it possible to vectorize the creation of Feature_2 ?

I hope, I am clear enough. Live code can be found here.

edited May 14 at 14:49

200_success

123k14143399

asked May 14 at 13:06

Romain

1184

add a commentÂ |Â

up vote
3
down vote

favorite

I am new to Python/Pandas.
Consider the following code:

import pandas as pd
import numpy as np

df = pd.DataFrame('Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19])
print(df)

Output:

 A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0

I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:

df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)

Output:

 A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0

df['feature_2'] = np.NaN

for i in range(len(df)):

 current_Id = df.Id[i]
 current_time = df.Time[i]

 idx = (df.Time == current_time) & (df.Id != current_Id)

 if idx.any():
 df['feature_2'][i] = df.A[idx].median()

print(df)

Output:

 A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0

This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).

Is it possible to vectorize the creation of Feature_2 ?

I hope, I am clear enough. Live code can be found here.

edited May 14 at 14:49

200_success

123k14143399

asked May 14 at 13:06

Romain

1184

add a commentÂ |Â

up vote
3
down vote

favorite

I am new to Python/Pandas.
Consider the following code:

import pandas as pd
import numpy as np

df = pd.DataFrame('Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19])
print(df)

Output:

 A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0

I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:

df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)

Output:

 A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0

df['feature_2'] = np.NaN

for i in range(len(df)):

 current_Id = df.Id[i]
 current_time = df.Time[i]

 idx = (df.Time == current_time) & (df.Id != current_Id)

 if idx.any():
 df['feature_2'][i] = df.A[idx].median()

print(df)

Output:

 A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0

This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).

Is it possible to vectorize the creation of Feature_2 ?

I hope, I am clear enough. Live code can be found here.

edited May 14 at 14:49

200_success

123k14143399

asked May 14 at 13:06

Romain

1184

I am new to Python/Pandas.
Consider the following code:

import pandas as pd
import numpy as np

df = pd.DataFrame('Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19])
print(df)

Output:

 A Id Time
0 10.0 1 0.0
1 15.0 1 1.0
2 NaN 1 2.0
3 11.0 2 0.0
4 16.0 2 1.0
5 25.0 2 2.0
6 10.0 3 0.0
7 15.0 3 2.0
8 9.0 4 0.0
9 14.0 4 1.0
10 19.0 4 2.0

I want to add a column Feature_1 which, for each row of the dataframe, compute the median of column A for ALL the values which have the same Time value. This can be done as follows:

df['Feature_1'] = df.groupby('Time')['A'].transform(np.median)
print(df)

Output:

 A Id Time Feature_1
0 10.0 1 0.0 10.0
1 15.0 1 1.0 15.0
2 NaN 1 2.0 19.0
3 11.0 2 0.0 10.0
4 16.0 2 1.0 15.0
5 25.0 2 2.0 19.0
6 10.0 3 0.0 10.0
7 15.0 3 2.0 19.0
8 9.0 4 0.0 10.0
9 14.0 4 1.0 15.0
10 19.0 4 2.0 19.0

df['feature_2'] = np.NaN

for i in range(len(df)):

 current_Id = df.Id[i]
 current_time = df.Time[i]

 idx = (df.Time == current_time) & (df.Id != current_Id)

 if idx.any():
 df['feature_2'][i] = df.A[idx].median()

print(df)

Output:

 A Id Time Feature_1 Feature_2
0 10.0 1 0.0 10.0 10.0
1 15.0 1 1.0 15.0 15.0
2 NaN 1 2.0 19.0 19.0
3 11.0 2 0.0 10.0 10.0
4 16.0 2 1.0 15.0 14.5
5 25.0 2 2.0 19.0 17.0
6 10.0 3 0.0 10.0 10.0
7 15.0 3 2.0 19.0 22.0
8 9.0 4 0.0 10.0 10.0
9 14.0 4 1.0 15.0 15.5
10 19.0 4 2.0 19.0 20.0

This is working but it is very slow as my dataframe has 1 million rows (but only four different IDs).

Is it possible to vectorize the creation of Feature_2 ?

I hope, I am clear enough. Live code can be found here.

edited May 14 at 14:49

200_success

123k14143399

asked May 14 at 13:06

Romain

1184

edited May 14 at 14:49

200_success

123k14143399

edited May 14 at 14:49

200_success

123k14143399

edited May 14 at 14:49

200_success

123k14143399

asked May 14 at 13:06

Romain

1184

asked May 14 at 13:06

Romain

1184

asked May 14 at 13:06

Romain

1184

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
4
down vote

accepted

So, you want to get the medians of the groups by removing each value from the group in turn:

 group => individual removal of values

 NaN [ ] NaN NaN NaN
 25.0 => 25.0 [ ] 25.0 25.0
 15.0 15.0 15.0 [ ] 15.0
 19.0 19.0 19.0 19.0 [ ]

median 19.0 19.0 17.0 22.0 20.0

An other way of doing, beside manually reconstructing the group without the current value for each value, is to build the above intermediate matrix and ask for the median on each column. This will return a Series of length the length of the group, which is supported by SeriesGroupBy.transform.

The steps to get the desired result are:

build the matrix by repeating the input group as many time as its length;

fill the diagonal of the matrix with NaNs;

ask for the median by row/column depending on how you built the matrix.

The function that can be fed to transform may look like:

def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)

An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:

import numpy as np
import pandas as pd


def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)


def compute_medians(dataframe, groups_column='Time', values_column='A'):
 groups = dataframe.groupby(groups_column)[values_column]
 dataframe['Feature_1'] = groups.transform(np.median)
 dataframe['Feature_2'] = groups.transform(median_without_element)


if __name__ == '__main__':
 df = pd.DataFrame(
 'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
 )
 compute_medians(df)
 print(df)

edited May 14 at 17:08

answered May 14 at 14:53

Mathias Ettinger

21.8k32875

Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â€“Â Romain
May 14 at 16:44

@Romain Absolutely, fixed it.
â€“Â Mathias Ettinger
May 14 at 17:09

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194361%2fcompute-conditional-median-of-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
4
down vote

accepted

So, you want to get the medians of the groups by removing each value from the group in turn:

 group => individual removal of values

 NaN [ ] NaN NaN NaN
 25.0 => 25.0 [ ] 25.0 25.0
 15.0 15.0 15.0 [ ] 15.0
 19.0 19.0 19.0 19.0 [ ]

median 19.0 19.0 17.0 22.0 20.0

The steps to get the desired result are:

build the matrix by repeating the input group as many time as its length;

fill the diagonal of the matrix with NaNs;

ask for the median by row/column depending on how you built the matrix.

The function that can be fed to transform may look like:

def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)

An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:

import numpy as np
import pandas as pd


def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)


def compute_medians(dataframe, groups_column='Time', values_column='A'):
 groups = dataframe.groupby(groups_column)[values_column]
 dataframe['Feature_1'] = groups.transform(np.median)
 dataframe['Feature_2'] = groups.transform(median_without_element)


if __name__ == '__main__':
 df = pd.DataFrame(
 'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
 )
 compute_medians(df)
 print(df)

edited May 14 at 17:08

answered May 14 at 14:53

Mathias Ettinger

21.8k32875

Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â€“Â Romain
May 14 at 16:44

@Romain Absolutely, fixed it.
â€“Â Mathias Ettinger
May 14 at 17:09

add a commentÂ |Â

up vote
4
down vote

accepted

So, you want to get the medians of the groups by removing each value from the group in turn:

 group => individual removal of values

 NaN [ ] NaN NaN NaN
 25.0 => 25.0 [ ] 25.0 25.0
 15.0 15.0 15.0 [ ] 15.0
 19.0 19.0 19.0 19.0 [ ]

median 19.0 19.0 17.0 22.0 20.0

The steps to get the desired result are:

build the matrix by repeating the input group as many time as its length;

fill the diagonal of the matrix with NaNs;

ask for the median by row/column depending on how you built the matrix.

The function that can be fed to transform may look like:

def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)

An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:

import numpy as np
import pandas as pd


def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)


def compute_medians(dataframe, groups_column='Time', values_column='A'):
 groups = dataframe.groupby(groups_column)[values_column]
 dataframe['Feature_1'] = groups.transform(np.median)
 dataframe['Feature_2'] = groups.transform(median_without_element)


if __name__ == '__main__':
 df = pd.DataFrame(
 'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
 )
 compute_medians(df)
 print(df)

edited May 14 at 17:08

answered May 14 at 14:53

Mathias Ettinger

21.8k32875

Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â€“Â Romain
May 14 at 16:44

@Romain Absolutely, fixed it.
â€“Â Mathias Ettinger
May 14 at 17:09

add a commentÂ |Â

up vote
4
down vote

accepted

So, you want to get the medians of the groups by removing each value from the group in turn:

 group => individual removal of values

 NaN [ ] NaN NaN NaN
 25.0 => 25.0 [ ] 25.0 25.0
 15.0 15.0 15.0 [ ] 15.0
 19.0 19.0 19.0 19.0 [ ]

median 19.0 19.0 17.0 22.0 20.0

The steps to get the desired result are:

build the matrix by repeating the input group as many time as its length;

fill the diagonal of the matrix with NaNs;

ask for the median by row/column depending on how you built the matrix.

The function that can be fed to transform may look like:

def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)

An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:

import numpy as np
import pandas as pd


def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)


def compute_medians(dataframe, groups_column='Time', values_column='A'):
 groups = dataframe.groupby(groups_column)[values_column]
 dataframe['Feature_1'] = groups.transform(np.median)
 dataframe['Feature_2'] = groups.transform(median_without_element)


if __name__ == '__main__':
 df = pd.DataFrame(
 'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
 )
 compute_medians(df)
 print(df)

edited May 14 at 17:08

answered May 14 at 14:53

Mathias Ettinger

21.8k32875

So, you want to get the medians of the groups by removing each value from the group in turn:

 group => individual removal of values

 NaN [ ] NaN NaN NaN
 25.0 => 25.0 [ ] 25.0 25.0
 15.0 15.0 15.0 [ ] 15.0
 19.0 19.0 19.0 19.0 [ ]

median 19.0 19.0 17.0 22.0 20.0

The steps to get the desired result are:

build the matrix by repeating the input group as many time as its length;

fill the diagonal of the matrix with NaNs;

ask for the median by row/column depending on how you built the matrix.

The function that can be fed to transform may look like:

def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)

An other advantage of this approach is that you are able to reuse the same groups of elements and so cut on the need to recompute them again and again:

import numpy as np
import pandas as pd


def median_without_element(group):
 matrix = pd.DataFrame([group] * len(group))
 np.fill_diagonal(matrix.values, np.NaN)
 return matrix.median(axis=1)


def compute_medians(dataframe, groups_column='Time', values_column='A'):
 groups = dataframe.groupby(groups_column)[values_column]
 dataframe['Feature_1'] = groups.transform(np.median)
 dataframe['Feature_2'] = groups.transform(median_without_element)


if __name__ == '__main__':
 df = pd.DataFrame(
 'Time': [0.0, 1.0, 2.0, 0.0, 1.0, 2.0, 0.0, 2.0, 0.0, 1.0, 2.0],
 'Id': [1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4],
 'A': [10, 15, np.NaN, 11, 16, 25, 10, 15, 9, 14, 19],
 )
 compute_medians(df)
 print(df)

edited May 14 at 17:08

answered May 14 at 14:53

Mathias Ettinger

21.8k32875

edited May 14 at 17:08

answered May 14 at 14:53

Mathias Ettinger

21.8k32875

answered May 14 at 14:53

Mathias Ettinger

21.8k32875

answered May 14 at 14:53

Mathias Ettinger

21.8k32875

Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â€“Â Romain
May 14 at 16:44

@Romain Absolutely, fixed it.
â€“Â Mathias Ettinger
May 14 at 17:09

add a commentÂ |Â

Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â€“Â Romain
May 14 at 16:44

@Romain Absolutely, fixed it.
â€“Â Mathias Ettinger
May 14 at 17:09

Thanks, It works ;). I have one question though: in the compute_medians function, shouldn't you replace df by dataframe ?
â€“Â Romain
May 14 at 16:44

@Romain Absolutely, fixed it.
â€“Â Mathias Ettinger
May 14 at 17:09

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

T9omKuVjOprQ,4 vgL,IUM,v Jl

搜尋此網誌

trjhtr