Count number of registers in interval & location

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.

The solution works great, but I had to adapt it to also take into account some localization key.

I did it through the following code:

def time_features(df, time_key, T, location_key, output_key):
 """
 Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
 """ 
 from datetime import date
 assert np.issubdtype(df[time_key], np.datetime64)
 output = pd.DataFrame()

 grouped = df.groupby(location_key)
 for name, group in grouped:
 # initialize times registers open as 1, close as -1
 start_times = group.copy()
 start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
 start_times[output_key] = 1

 aux = group.copy()
 all_times = start_times.copy()
 aux[output_key] = -1 
 all_times = all_times.append(aux, ignore_index=True)

 # sort by time and perform a cumulative sum to get opened registers
 # (subtract 1 since you don't want to include the current time as opened)
 all_times = all_times.sort_values(by=time_key)
 all_times[output_key] = all_times[output_key].cumsum() - 1

 # revert the index back to original order, and truncate closed times
 all_times = all_times.sort_index().iloc[:len(all_times)//2]
 output = output.append(all_times, ignore_index=True)
 return output

Output:

time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"

time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')

This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.

edited Mar 20 at 17:08

Dannnno

5,3781649

asked Mar 20 at 16:21

pceccon

1133

Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â€“Â RCA
Mar 29 at 18:21

add a commentÂ |Â

up vote
2
down vote

favorite

Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.

The solution works great, but I had to adapt it to also take into account some localization key.

I did it through the following code:

def time_features(df, time_key, T, location_key, output_key):
 """
 Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
 """ 
 from datetime import date
 assert np.issubdtype(df[time_key], np.datetime64)
 output = pd.DataFrame()

 grouped = df.groupby(location_key)
 for name, group in grouped:
 # initialize times registers open as 1, close as -1
 start_times = group.copy()
 start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
 start_times[output_key] = 1

 aux = group.copy()
 all_times = start_times.copy()
 aux[output_key] = -1 
 all_times = all_times.append(aux, ignore_index=True)

 # sort by time and perform a cumulative sum to get opened registers
 # (subtract 1 since you don't want to include the current time as opened)
 all_times = all_times.sort_values(by=time_key)
 all_times[output_key] = all_times[output_key].cumsum() - 1

 # revert the index back to original order, and truncate closed times
 all_times = all_times.sort_index().iloc[:len(all_times)//2]
 output = output.append(all_times, ignore_index=True)
 return output

Output:

time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"

time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')

This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.

edited Mar 20 at 17:08

Dannnno

5,3781649

asked Mar 20 at 16:21

pceccon

1133

Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â€“Â RCA
Mar 29 at 18:21

add a commentÂ |Â

up vote
2
down vote

favorite

Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.

The solution works great, but I had to adapt it to also take into account some localization key.

I did it through the following code:

def time_features(df, time_key, T, location_key, output_key):
 """
 Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
 """ 
 from datetime import date
 assert np.issubdtype(df[time_key], np.datetime64)
 output = pd.DataFrame()

 grouped = df.groupby(location_key)
 for name, group in grouped:
 # initialize times registers open as 1, close as -1
 start_times = group.copy()
 start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
 start_times[output_key] = 1

 aux = group.copy()
 all_times = start_times.copy()
 aux[output_key] = -1 
 all_times = all_times.append(aux, ignore_index=True)

 # sort by time and perform a cumulative sum to get opened registers
 # (subtract 1 since you don't want to include the current time as opened)
 all_times = all_times.sort_values(by=time_key)
 all_times[output_key] = all_times[output_key].cumsum() - 1

 # revert the index back to original order, and truncate closed times
 all_times = all_times.sort_index().iloc[:len(all_times)//2]
 output = output.append(all_times, ignore_index=True)
 return output

Output:

time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"

time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')

This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.

edited Mar 20 at 17:08

Dannnno

5,3781649

asked Mar 20 at 16:21

pceccon

1133

Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.

The solution works great, but I had to adapt it to also take into account some localization key.

I did it through the following code:

def time_features(df, time_key, T, location_key, output_key):
 """
 Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
 """ 
 from datetime import date
 assert np.issubdtype(df[time_key], np.datetime64)
 output = pd.DataFrame()

 grouped = df.groupby(location_key)
 for name, group in grouped:
 # initialize times registers open as 1, close as -1
 start_times = group.copy()
 start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
 start_times[output_key] = 1

 aux = group.copy()
 all_times = start_times.copy()
 aux[output_key] = -1 
 all_times = all_times.append(aux, ignore_index=True)

 # sort by time and perform a cumulative sum to get opened registers
 # (subtract 1 since you don't want to include the current time as opened)
 all_times = all_times.sort_values(by=time_key)
 all_times[output_key] = all_times[output_key].cumsum() - 1

 # revert the index back to original order, and truncate closed times
 all_times = all_times.sort_index().iloc[:len(all_times)//2]
 output = output.append(all_times, ignore_index=True)
 return output

Output:

time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"

time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')

This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.

edited Mar 20 at 17:08

Dannnno

5,3781649

asked Mar 20 at 16:21

pceccon

1133

edited Mar 20 at 17:08

Dannnno

5,3781649

edited Mar 20 at 17:08

Dannnno

5,3781649

edited Mar 20 at 17:08

Dannnno

5,3781649

asked Mar 20 at 16:21

pceccon

1133

asked Mar 20 at 16:21

pceccon

1133

asked Mar 20 at 16:21

pceccon

1133

Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â€“Â RCA
Mar 29 at 18:21

add a commentÂ |Â

Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â€“Â RCA
Mar 29 at 18:21

Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
â€“Â RCA
Mar 29 at 18:21

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.

Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.

Specifically change:

output = pd.DataFrame()

To a list:

output =

And then append to list inside loop and then pd.concat(list) outside loop.

def time_features(df, time_key, T, location_key, output_key):
 """
 Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
 """ 
 from datetime import date
 assert np.issubdtype(df[time_key], np.datetime64)
 output = 

 grouped = df.groupby(location_key)
 for name, group in grouped:
 # initialize times registers open as 1, close as -1
 start_times = group.copy()
 start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
 start_times[output_key] = 1

 aux = group.copy()
 all_times = start_times.copy()
 aux[output_key] = -1 
 all_times = all_times.append(aux, ignore_index=True)

 # sort by time and perform a cumulative sum to get opened registers
 # (subtract 1 since you don't want to include the current time as opened)
 all_times = all_times.sort_values(by=time_key)
 all_times[output_key] = all_times[output_key].cumsum() - 1

 # revert the index back to original order, and truncate closed times
 all_times = all_times.sort_index().iloc[:len(all_times)//2]
 # APPEND TO LIST
 output.append(all_times)

 # CONCATENATE ALL DF ELEMENTS
 final_df = pd.concat(output, ignore_index=True)

 return final_df

answered Apr 23 at 18:44

Parfait

46828

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190047%2fcount-number-of-registers-in-interval-location%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.

Specifically change:

output = pd.DataFrame()

To a list:

output =

And then append to list inside loop and then pd.concat(list) outside loop.

def time_features(df, time_key, T, location_key, output_key):
 """
 Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
 """ 
 from datetime import date
 assert np.issubdtype(df[time_key], np.datetime64)
 output = 

 grouped = df.groupby(location_key)
 for name, group in grouped:
 # initialize times registers open as 1, close as -1
 start_times = group.copy()
 start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
 start_times[output_key] = 1

 aux = group.copy()
 all_times = start_times.copy()
 aux[output_key] = -1 
 all_times = all_times.append(aux, ignore_index=True)

 # sort by time and perform a cumulative sum to get opened registers
 # (subtract 1 since you don't want to include the current time as opened)
 all_times = all_times.sort_values(by=time_key)
 all_times[output_key] = all_times[output_key].cumsum() - 1

 # revert the index back to original order, and truncate closed times
 all_times = all_times.sort_index().iloc[:len(all_times)//2]
 # APPEND TO LIST
 output.append(all_times)

 # CONCATENATE ALL DF ELEMENTS
 final_df = pd.concat(output, ignore_index=True)

 return final_df

answered Apr 23 at 18:44

Parfait

46828

add a commentÂ |Â

up vote
0
down vote

accepted

Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.

Specifically change:

output = pd.DataFrame()

To a list:

output =

And then append to list inside loop and then pd.concat(list) outside loop.

def time_features(df, time_key, T, location_key, output_key):
 """
 Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
 """ 
 from datetime import date
 assert np.issubdtype(df[time_key], np.datetime64)
 output = 

 grouped = df.groupby(location_key)
 for name, group in grouped:
 # initialize times registers open as 1, close as -1
 start_times = group.copy()
 start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
 start_times[output_key] = 1

 aux = group.copy()
 all_times = start_times.copy()
 aux[output_key] = -1 
 all_times = all_times.append(aux, ignore_index=True)

 # sort by time and perform a cumulative sum to get opened registers
 # (subtract 1 since you don't want to include the current time as opened)
 all_times = all_times.sort_values(by=time_key)
 all_times[output_key] = all_times[output_key].cumsum() - 1

 # revert the index back to original order, and truncate closed times
 all_times = all_times.sort_index().iloc[:len(all_times)//2]
 # APPEND TO LIST
 output.append(all_times)

 # CONCATENATE ALL DF ELEMENTS
 final_df = pd.concat(output, ignore_index=True)

 return final_df

answered Apr 23 at 18:44

Parfait

46828

add a commentÂ |Â

up vote
0
down vote

accepted

Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.

Specifically change:

output = pd.DataFrame()

To a list:

output =

And then append to list inside loop and then pd.concat(list) outside loop.

def time_features(df, time_key, T, location_key, output_key):
 """
 Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
 """ 
 from datetime import date
 assert np.issubdtype(df[time_key], np.datetime64)
 output = 

 grouped = df.groupby(location_key)
 for name, group in grouped:
 # initialize times registers open as 1, close as -1
 start_times = group.copy()
 start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
 start_times[output_key] = 1

 aux = group.copy()
 all_times = start_times.copy()
 aux[output_key] = -1 
 all_times = all_times.append(aux, ignore_index=True)

 # sort by time and perform a cumulative sum to get opened registers
 # (subtract 1 since you don't want to include the current time as opened)
 all_times = all_times.sort_values(by=time_key)
 all_times[output_key] = all_times[output_key].cumsum() - 1

 # revert the index back to original order, and truncate closed times
 all_times = all_times.sort_index().iloc[:len(all_times)//2]
 # APPEND TO LIST
 output.append(all_times)

 # CONCATENATE ALL DF ELEMENTS
 final_df = pd.concat(output, ignore_index=True)

 return final_df

answered Apr 23 at 18:44

Parfait

46828

Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.

Specifically change:

output = pd.DataFrame()

To a list:

output =

And then append to list inside loop and then pd.concat(list) outside loop.

def time_features(df, time_key, T, location_key, output_key):
 """
 Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
 """ 
 from datetime import date
 assert np.issubdtype(df[time_key], np.datetime64)
 output = 

 grouped = df.groupby(location_key)
 for name, group in grouped:
 # initialize times registers open as 1, close as -1
 start_times = group.copy()
 start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
 start_times[output_key] = 1

 aux = group.copy()
 all_times = start_times.copy()
 aux[output_key] = -1 
 all_times = all_times.append(aux, ignore_index=True)

 # sort by time and perform a cumulative sum to get opened registers
 # (subtract 1 since you don't want to include the current time as opened)
 all_times = all_times.sort_values(by=time_key)
 all_times[output_key] = all_times[output_key].cumsum() - 1

 # revert the index back to original order, and truncate closed times
 all_times = all_times.sort_index().iloc[:len(all_times)//2]
 # APPEND TO LIST
 output.append(all_times)

 # CONCATENATE ALL DF ELEMENTS
 final_df = pd.concat(output, ignore_index=True)

 return final_df

answered Apr 23 at 18:44

Parfait

46828

answered Apr 23 at 18:44

Parfait

46828

answered Apr 23 at 18:44

Parfait

46828

answered Apr 23 at 18:44

Parfait

46828

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr