Count number of registers in interval & location
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.
The solution works great, but I had to adapt it to also take into account some localization key.
I did it through the following code:
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output
Output:
time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"
time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')
This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.
python time-limit-exceeded pandas
add a comment |Â
up vote
2
down vote
favorite
Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.
The solution works great, but I had to adapt it to also take into account some localization key.
I did it through the following code:
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output
Output:
time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"
time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')
This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.
python time-limit-exceeded pandas
Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â RCA
Mar 29 at 18:21
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.
The solution works great, but I had to adapt it to also take into account some localization key.
I did it through the following code:
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output
Output:
time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"
time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')
This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.
python time-limit-exceeded pandas
Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.
The solution works great, but I had to adapt it to also take into account some localization key.
I did it through the following code:
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output
Output:
time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"
time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')
This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.
python time-limit-exceeded pandas
edited Mar 20 at 17:08
Dannnno
5,3781649
5,3781649
asked Mar 20 at 16:21
pceccon
1133
1133
Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â RCA
Mar 29 at 18:21
add a comment |Â
Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â RCA
Mar 29 at 18:21
Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â RCA
Mar 29 at 18:21
Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â RCA
Mar 29 at 18:21
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
Consider not expanding a dataframe inside the for
loop but build a list or dictionary and then concatenate all dataframe elements outside loop.
Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.
Specifically change:
output = pd.DataFrame()
To a list:
output =
And then append to list inside loop and then pd.concat(list)
outside loop.
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output =
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
# APPEND TO LIST
output.append(all_times)
# CONCATENATE ALL DF ELEMENTS
final_df = pd.concat(output, ignore_index=True)
return final_df
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
Consider not expanding a dataframe inside the for
loop but build a list or dictionary and then concatenate all dataframe elements outside loop.
Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.
Specifically change:
output = pd.DataFrame()
To a list:
output =
And then append to list inside loop and then pd.concat(list)
outside loop.
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output =
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
# APPEND TO LIST
output.append(all_times)
# CONCATENATE ALL DF ELEMENTS
final_df = pd.concat(output, ignore_index=True)
return final_df
add a comment |Â
up vote
0
down vote
accepted
Consider not expanding a dataframe inside the for
loop but build a list or dictionary and then concatenate all dataframe elements outside loop.
Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.
Specifically change:
output = pd.DataFrame()
To a list:
output =
And then append to list inside loop and then pd.concat(list)
outside loop.
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output =
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
# APPEND TO LIST
output.append(all_times)
# CONCATENATE ALL DF ELEMENTS
final_df = pd.concat(output, ignore_index=True)
return final_df
add a comment |Â
up vote
0
down vote
accepted
up vote
0
down vote
accepted
Consider not expanding a dataframe inside the for
loop but build a list or dictionary and then concatenate all dataframe elements outside loop.
Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.
Specifically change:
output = pd.DataFrame()
To a list:
output =
And then append to list inside loop and then pd.concat(list)
outside loop.
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output =
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
# APPEND TO LIST
output.append(all_times)
# CONCATENATE ALL DF ELEMENTS
final_df = pd.concat(output, ignore_index=True)
return final_df
Consider not expanding a dataframe inside the for
loop but build a list or dictionary and then concatenate all dataframe elements outside loop.
Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.
Specifically change:
output = pd.DataFrame()
To a list:
output =
And then append to list inside loop and then pd.concat(list)
outside loop.
def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output =
grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1
aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)
# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1
# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
# APPEND TO LIST
output.append(all_times)
# CONCATENATE ALL DF ELEMENTS
final_df = pd.concat(output, ignore_index=True)
return final_df
answered Apr 23 at 18:44
Parfait
46828
46828
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190047%2fcount-number-of-registers-in-interval-location%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â RCA
Mar 29 at 18:21