Count number of registers in interval & location

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.



The solution works great, but I had to adapt it to also take into account some localization key.



I did it through the following code:



def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()

grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1

aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)

# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1

# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output


Output:



time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"

time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')


This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.







share|improve this question





















  • Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
    – RCA
    Mar 29 at 18:21
















up vote
2
down vote

favorite












Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.



The solution works great, but I had to adapt it to also take into account some localization key.



I did it through the following code:



def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()

grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1

aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)

# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1

# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output


Output:



time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"

time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')


This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.







share|improve this question





















  • Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
    – RCA
    Mar 29 at 18:21












up vote
2
down vote

favorite









up vote
2
down vote

favorite











Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.



The solution works great, but I had to adapt it to also take into account some localization key.



I did it through the following code:



def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()

grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1

aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)

# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1

# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output


Output:



time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"

time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')


This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.







share|improve this question













Recently I asked how one could count the number of registers by the interval as answered in https://stackoverflow.com/questions/49240140/count-number-of-registers-in-interval.



The solution works great, but I had to adapt it to also take into account some localization key.



I did it through the following code:



def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output = pd.DataFrame()

grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1

aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)

# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1

# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
output = output.append(all_times, ignore_index=True)
return output


Output:



time loc1 loc2
0 2013-01-01 12:56:00 1 "a"
1 2013-01-01 12:00:12 1 "b"
2 2013-01-01 10:34:28 2 "c"
3 2013-01-01 09:34:54 2 "c"
4 2013-01-01 08:34:55 3 "d"
5 2013-01-01 08:34:55 5 "d"
6 2013-01-01 16:35:19 4 "e"
7 2013-01-01 16:35:30 4 "e"

time_features(df, time_key='time', T=2, location_key='loc1', output_key='count')


This works great for small data, but for longer data (I using it with a file with 1 million rows) it takes "forever" to run. I wonder if I could optimize this computation somehow.









share|improve this question












share|improve this question




share|improve this question








edited Mar 20 at 17:08









Dannnno

5,3781649




5,3781649









asked Mar 20 at 16:21









pceccon

1133




1133











  • Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
    – RCA
    Mar 29 at 18:21
















  • Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
    – RCA
    Mar 29 at 18:21















Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
– RCA
Mar 29 at 18:21




Is the localization key about time zone? If so it's probably better to just change the time zones according to that, before getting into a loop. Pytz would be a good place to start for that.
– RCA
Mar 29 at 18:21










1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted










Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.



Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.



Specifically change:



output = pd.DataFrame()


To a list:



output = 


And then append to list inside loop and then pd.concat(list) outside loop.



def time_features(df, time_key, T, location_key, output_key):
"""
Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
"""
from datetime import date
assert np.issubdtype(df[time_key], np.datetime64)
output =

grouped = df.groupby(location_key)
for name, group in grouped:
# initialize times registers open as 1, close as -1
start_times = group.copy()
start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
start_times[output_key] = 1

aux = group.copy()
all_times = start_times.copy()
aux[output_key] = -1
all_times = all_times.append(aux, ignore_index=True)

# sort by time and perform a cumulative sum to get opened registers
# (subtract 1 since you don't want to include the current time as opened)
all_times = all_times.sort_values(by=time_key)
all_times[output_key] = all_times[output_key].cumsum() - 1

# revert the index back to original order, and truncate closed times
all_times = all_times.sort_index().iloc[:len(all_times)//2]
# APPEND TO LIST
output.append(all_times)

# CONCATENATE ALL DF ELEMENTS
final_df = pd.concat(output, ignore_index=True)

return final_df





share|improve this answer





















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190047%2fcount-number-of-registers-in-interval-location%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote



    accepted










    Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.



    Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.



    Specifically change:



    output = pd.DataFrame()


    To a list:



    output = 


    And then append to list inside loop and then pd.concat(list) outside loop.



    def time_features(df, time_key, T, location_key, output_key):
    """
    Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
    """
    from datetime import date
    assert np.issubdtype(df[time_key], np.datetime64)
    output =

    grouped = df.groupby(location_key)
    for name, group in grouped:
    # initialize times registers open as 1, close as -1
    start_times = group.copy()
    start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
    start_times[output_key] = 1

    aux = group.copy()
    all_times = start_times.copy()
    aux[output_key] = -1
    all_times = all_times.append(aux, ignore_index=True)

    # sort by time and perform a cumulative sum to get opened registers
    # (subtract 1 since you don't want to include the current time as opened)
    all_times = all_times.sort_values(by=time_key)
    all_times[output_key] = all_times[output_key].cumsum() - 1

    # revert the index back to original order, and truncate closed times
    all_times = all_times.sort_index().iloc[:len(all_times)//2]
    # APPEND TO LIST
    output.append(all_times)

    # CONCATENATE ALL DF ELEMENTS
    final_df = pd.concat(output, ignore_index=True)

    return final_df





    share|improve this answer

























      up vote
      0
      down vote



      accepted










      Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.



      Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.



      Specifically change:



      output = pd.DataFrame()


      To a list:



      output = 


      And then append to list inside loop and then pd.concat(list) outside loop.



      def time_features(df, time_key, T, location_key, output_key):
      """
      Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
      """
      from datetime import date
      assert np.issubdtype(df[time_key], np.datetime64)
      output =

      grouped = df.groupby(location_key)
      for name, group in grouped:
      # initialize times registers open as 1, close as -1
      start_times = group.copy()
      start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
      start_times[output_key] = 1

      aux = group.copy()
      all_times = start_times.copy()
      aux[output_key] = -1
      all_times = all_times.append(aux, ignore_index=True)

      # sort by time and perform a cumulative sum to get opened registers
      # (subtract 1 since you don't want to include the current time as opened)
      all_times = all_times.sort_values(by=time_key)
      all_times[output_key] = all_times[output_key].cumsum() - 1

      # revert the index back to original order, and truncate closed times
      all_times = all_times.sort_index().iloc[:len(all_times)//2]
      # APPEND TO LIST
      output.append(all_times)

      # CONCATENATE ALL DF ELEMENTS
      final_df = pd.concat(output, ignore_index=True)

      return final_df





      share|improve this answer























        up vote
        0
        down vote



        accepted







        up vote
        0
        down vote



        accepted






        Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.



        Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.



        Specifically change:



        output = pd.DataFrame()


        To a list:



        output = 


        And then append to list inside loop and then pd.concat(list) outside loop.



        def time_features(df, time_key, T, location_key, output_key):
        """
        Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
        """
        from datetime import date
        assert np.issubdtype(df[time_key], np.datetime64)
        output =

        grouped = df.groupby(location_key)
        for name, group in grouped:
        # initialize times registers open as 1, close as -1
        start_times = group.copy()
        start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
        start_times[output_key] = 1

        aux = group.copy()
        all_times = start_times.copy()
        aux[output_key] = -1
        all_times = all_times.append(aux, ignore_index=True)

        # sort by time and perform a cumulative sum to get opened registers
        # (subtract 1 since you don't want to include the current time as opened)
        all_times = all_times.sort_values(by=time_key)
        all_times[output_key] = all_times[output_key].cumsum() - 1

        # revert the index back to original order, and truncate closed times
        all_times = all_times.sort_index().iloc[:len(all_times)//2]
        # APPEND TO LIST
        output.append(all_times)

        # CONCATENATE ALL DF ELEMENTS
        final_df = pd.concat(output, ignore_index=True)

        return final_df





        share|improve this answer













        Consider not expanding a dataframe inside the for loop but build a list or dictionary and then concatenate all dataframe elements outside loop.



        Expanding objects within a loop causes substantial memory resources to allocate before and after blocks with lots of copying of objects to do so. Running one call on outside should substantially run faster as virtually no copying is done.



        Specifically change:



        output = pd.DataFrame()


        To a list:



        output = 


        And then append to list inside loop and then pd.concat(list) outside loop.



        def time_features(df, time_key, T, location_key, output_key):
        """
        Create features based on time such as: how many BDs are open in the same GRA at this moment (hour)?
        """
        from datetime import date
        assert np.issubdtype(df[time_key], np.datetime64)
        output =

        grouped = df.groupby(location_key)
        for name, group in grouped:
        # initialize times registers open as 1, close as -1
        start_times = group.copy()
        start_times[time_key] = group[time_key]-pd.Timedelta(hours=T)
        start_times[output_key] = 1

        aux = group.copy()
        all_times = start_times.copy()
        aux[output_key] = -1
        all_times = all_times.append(aux, ignore_index=True)

        # sort by time and perform a cumulative sum to get opened registers
        # (subtract 1 since you don't want to include the current time as opened)
        all_times = all_times.sort_values(by=time_key)
        all_times[output_key] = all_times[output_key].cumsum() - 1

        # revert the index back to original order, and truncate closed times
        all_times = all_times.sort_index().iloc[:len(all_times)//2]
        # APPEND TO LIST
        output.append(all_times)

        # CONCATENATE ALL DF ELEMENTS
        final_df = pd.concat(output, ignore_index=True)

        return final_df






        share|improve this answer













        share|improve this answer



        share|improve this answer











        answered Apr 23 at 18:44









        Parfait

        46828




        46828






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f190047%2fcount-number-of-registers-in-interval-location%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Greedy Best First Search implementation in Rust

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            C++11 CLH Lock Implementation