Non normalized set difference algorithm

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
5
down vote

favorite












I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).



Objective:



DataSet1: DataSet2:
A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1

DataSet1 - DataSet2 = ResultSet

ResultSet:
A B C
1 6 5 1
2 4 4 3


Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.



The parameters of this exercise are such:




  1. Extra columns in the subtrahend (DataSet2) must be ignored.

  2. Instances of a record in DataSet1 that also exists in Dataset two
    must be be removed from DataSet1 until either there are no
    instances of the duplicate left in DataSet1 or there are no
    instances left in DataSet2.

  3. In line with the above is a certian
    record is duplicated 3 times in DataSet1 and once in DataSet2 then
    two of those duplicates should remain in duplicate 1. Else if it's
    the other way around 1-3 = -2 so all duplicates of that record are
    removed from the returned DataSet.

  4. We must assume that the name
    and number of columns, rows, index positions, are all unknown.



My Algorithm So Far:



import pandas as pd
import numpy as np
import copy

def __sub__(self, arg):
"""docstring"""

#Create a variable that holds the column names of self. We
# will use this filter and thus ignore any extra columns in arg
lstOutputColumns = self.columns.tolist()

#Group data into normalized sets we can use to break the data
# apart. These groups are returned usint pd.Dataframe.size() which
# also gives me the the count of times a record orccured in the
# origional data set (self & arg).
dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()

#Merge the normalized data so as to get all the data that in the
# subtrahend set (DataSet2) that matches a record in Dataset# and
# we can forget about the rest.
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)

#Add a calculated column to the merged subtrahend set to get the
# difference between column counts that our groupby.size() appened
# to each row two steps ago. This all done using iloc so as to
# avoid naming columns since I can't guarantee any particular column
# name isn't already in use.
dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)


#The result of the last three steps is a DataFrame with only
# rows that exist in both sets, with the count of the time each
# particular row exists on the far left of the table along with the
# difference between those counts. It should end up so that the
# last three columns of the DataFrame are
# (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
# Now we iterate through rows and construct a new data set based on
# the difference in the last column.
lstRows =
for index, row in dfMergedArg.iterrows():
if row.iloc[-1] > 0:
dictRow =
dictRow.update(row)
lstRows += [dictRow] * row[-1]

#Create a new dataframe with the rows we created in the the
#lst Variable.
dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)

#This next part is a simple left anti-join to get the rest of
# data out of DataSet1 that is unaffected by DataSet2.
dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]

#Now we put both datasets back together in a single DataFrame
dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()

#Return the result
return dfCombined[lstOutputColumns]


This works, the reason i've posted it here is because it's not very efficient. The creation of the multiple DataFrames during a run cause it to be a memory hog. Also, the use of iterrows() I feel is like a last resort that inevitably results in slow execution. I think the problem is interesting though because its about dealing with really un-ideal data situations that (lets face it) occur all the time.



Alright StackExchange - please rip me apart now!







share|improve this question



















  • Any reason for naming it __sub__? or is it a method inside some class?
    – hjpotter92
    Jul 31 at 13:10










  • Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
    – Jamie Marshall
    Jul 31 at 16:20

















up vote
5
down vote

favorite












I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).



Objective:



DataSet1: DataSet2:
A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1

DataSet1 - DataSet2 = ResultSet

ResultSet:
A B C
1 6 5 1
2 4 4 3


Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.



The parameters of this exercise are such:




  1. Extra columns in the subtrahend (DataSet2) must be ignored.

  2. Instances of a record in DataSet1 that also exists in Dataset two
    must be be removed from DataSet1 until either there are no
    instances of the duplicate left in DataSet1 or there are no
    instances left in DataSet2.

  3. In line with the above is a certian
    record is duplicated 3 times in DataSet1 and once in DataSet2 then
    two of those duplicates should remain in duplicate 1. Else if it's
    the other way around 1-3 = -2 so all duplicates of that record are
    removed from the returned DataSet.

  4. We must assume that the name
    and number of columns, rows, index positions, are all unknown.



My Algorithm So Far:



import pandas as pd
import numpy as np
import copy

def __sub__(self, arg):
"""docstring"""

#Create a variable that holds the column names of self. We
# will use this filter and thus ignore any extra columns in arg
lstOutputColumns = self.columns.tolist()

#Group data into normalized sets we can use to break the data
# apart. These groups are returned usint pd.Dataframe.size() which
# also gives me the the count of times a record orccured in the
# origional data set (self & arg).
dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()

#Merge the normalized data so as to get all the data that in the
# subtrahend set (DataSet2) that matches a record in Dataset# and
# we can forget about the rest.
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)

#Add a calculated column to the merged subtrahend set to get the
# difference between column counts that our groupby.size() appened
# to each row two steps ago. This all done using iloc so as to
# avoid naming columns since I can't guarantee any particular column
# name isn't already in use.
dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)


#The result of the last three steps is a DataFrame with only
# rows that exist in both sets, with the count of the time each
# particular row exists on the far left of the table along with the
# difference between those counts. It should end up so that the
# last three columns of the DataFrame are
# (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
# Now we iterate through rows and construct a new data set based on
# the difference in the last column.
lstRows =
for index, row in dfMergedArg.iterrows():
if row.iloc[-1] > 0:
dictRow =
dictRow.update(row)
lstRows += [dictRow] * row[-1]

#Create a new dataframe with the rows we created in the the
#lst Variable.
dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)

#This next part is a simple left anti-join to get the rest of
# data out of DataSet1 that is unaffected by DataSet2.
dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]

#Now we put both datasets back together in a single DataFrame
dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()

#Return the result
return dfCombined[lstOutputColumns]


This works, the reason i've posted it here is because it's not very efficient. The creation of the multiple DataFrames during a run cause it to be a memory hog. Also, the use of iterrows() I feel is like a last resort that inevitably results in slow execution. I think the problem is interesting though because its about dealing with really un-ideal data situations that (lets face it) occur all the time.



Alright StackExchange - please rip me apart now!







share|improve this question



















  • Any reason for naming it __sub__? or is it a method inside some class?
    – hjpotter92
    Jul 31 at 13:10










  • Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
    – Jamie Marshall
    Jul 31 at 16:20













up vote
5
down vote

favorite









up vote
5
down vote

favorite











I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).



Objective:



DataSet1: DataSet2:
A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1

DataSet1 - DataSet2 = ResultSet

ResultSet:
A B C
1 6 5 1
2 4 4 3


Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.



The parameters of this exercise are such:




  1. Extra columns in the subtrahend (DataSet2) must be ignored.

  2. Instances of a record in DataSet1 that also exists in Dataset two
    must be be removed from DataSet1 until either there are no
    instances of the duplicate left in DataSet1 or there are no
    instances left in DataSet2.

  3. In line with the above is a certian
    record is duplicated 3 times in DataSet1 and once in DataSet2 then
    two of those duplicates should remain in duplicate 1. Else if it's
    the other way around 1-3 = -2 so all duplicates of that record are
    removed from the returned DataSet.

  4. We must assume that the name
    and number of columns, rows, index positions, are all unknown.



My Algorithm So Far:



import pandas as pd
import numpy as np
import copy

def __sub__(self, arg):
"""docstring"""

#Create a variable that holds the column names of self. We
# will use this filter and thus ignore any extra columns in arg
lstOutputColumns = self.columns.tolist()

#Group data into normalized sets we can use to break the data
# apart. These groups are returned usint pd.Dataframe.size() which
# also gives me the the count of times a record orccured in the
# origional data set (self & arg).
dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()

#Merge the normalized data so as to get all the data that in the
# subtrahend set (DataSet2) that matches a record in Dataset# and
# we can forget about the rest.
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)

#Add a calculated column to the merged subtrahend set to get the
# difference between column counts that our groupby.size() appened
# to each row two steps ago. This all done using iloc so as to
# avoid naming columns since I can't guarantee any particular column
# name isn't already in use.
dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)


#The result of the last three steps is a DataFrame with only
# rows that exist in both sets, with the count of the time each
# particular row exists on the far left of the table along with the
# difference between those counts. It should end up so that the
# last three columns of the DataFrame are
# (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
# Now we iterate through rows and construct a new data set based on
# the difference in the last column.
lstRows =
for index, row in dfMergedArg.iterrows():
if row.iloc[-1] > 0:
dictRow =
dictRow.update(row)
lstRows += [dictRow] * row[-1]

#Create a new dataframe with the rows we created in the the
#lst Variable.
dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)

#This next part is a simple left anti-join to get the rest of
# data out of DataSet1 that is unaffected by DataSet2.
dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]

#Now we put both datasets back together in a single DataFrame
dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()

#Return the result
return dfCombined[lstOutputColumns]


This works, the reason i've posted it here is because it's not very efficient. The creation of the multiple DataFrames during a run cause it to be a memory hog. Also, the use of iterrows() I feel is like a last resort that inevitably results in slow execution. I think the problem is interesting though because its about dealing with really un-ideal data situations that (lets face it) occur all the time.



Alright StackExchange - please rip me apart now!







share|improve this question











I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).



Objective:



DataSet1: DataSet2:
A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1

DataSet1 - DataSet2 = ResultSet

ResultSet:
A B C
1 6 5 1
2 4 4 3


Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.



The parameters of this exercise are such:




  1. Extra columns in the subtrahend (DataSet2) must be ignored.

  2. Instances of a record in DataSet1 that also exists in Dataset two
    must be be removed from DataSet1 until either there are no
    instances of the duplicate left in DataSet1 or there are no
    instances left in DataSet2.

  3. In line with the above is a certian
    record is duplicated 3 times in DataSet1 and once in DataSet2 then
    two of those duplicates should remain in duplicate 1. Else if it's
    the other way around 1-3 = -2 so all duplicates of that record are
    removed from the returned DataSet.

  4. We must assume that the name
    and number of columns, rows, index positions, are all unknown.



My Algorithm So Far:



import pandas as pd
import numpy as np
import copy

def __sub__(self, arg):
"""docstring"""

#Create a variable that holds the column names of self. We
# will use this filter and thus ignore any extra columns in arg
lstOutputColumns = self.columns.tolist()

#Group data into normalized sets we can use to break the data
# apart. These groups are returned usint pd.Dataframe.size() which
# also gives me the the count of times a record orccured in the
# origional data set (self & arg).
dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()

#Merge the normalized data so as to get all the data that in the
# subtrahend set (DataSet2) that matches a record in Dataset# and
# we can forget about the rest.
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)

#Add a calculated column to the merged subtrahend set to get the
# difference between column counts that our groupby.size() appened
# to each row two steps ago. This all done using iloc so as to
# avoid naming columns since I can't guarantee any particular column
# name isn't already in use.
dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)


#The result of the last three steps is a DataFrame with only
# rows that exist in both sets, with the count of the time each
# particular row exists on the far left of the table along with the
# difference between those counts. It should end up so that the
# last three columns of the DataFrame are
# (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
# Now we iterate through rows and construct a new data set based on
# the difference in the last column.
lstRows =
for index, row in dfMergedArg.iterrows():
if row.iloc[-1] > 0:
dictRow =
dictRow.update(row)
lstRows += [dictRow] * row[-1]

#Create a new dataframe with the rows we created in the the
#lst Variable.
dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)

#This next part is a simple left anti-join to get the rest of
# data out of DataSet1 that is unaffected by DataSet2.
dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]

#Now we put both datasets back together in a single DataFrame
dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()

#Return the result
return dfCombined[lstOutputColumns]


This works, the reason i've posted it here is because it's not very efficient. The creation of the multiple DataFrames during a run cause it to be a memory hog. Also, the use of iterrows() I feel is like a last resort that inevitably results in slow execution. I think the problem is interesting though because its about dealing with really un-ideal data situations that (lets face it) occur all the time.



Alright StackExchange - please rip me apart now!









share|improve this question










share|improve this question




share|improve this question









asked Jul 31 at 0:04









Jamie Marshall

1261




1261











  • Any reason for naming it __sub__? or is it a method inside some class?
    – hjpotter92
    Jul 31 at 13:10










  • Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
    – Jamie Marshall
    Jul 31 at 16:20

















  • Any reason for naming it __sub__? or is it a method inside some class?
    – hjpotter92
    Jul 31 at 13:10










  • Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
    – Jamie Marshall
    Jul 31 at 16:20
















Any reason for naming it __sub__? or is it a method inside some class?
– hjpotter92
Jul 31 at 13:10




Any reason for naming it __sub__? or is it a method inside some class?
– hjpotter92
Jul 31 at 13:10












Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
– Jamie Marshall
Jul 31 at 16:20





Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
– Jamie Marshall
Jul 31 at 16:20











1 Answer
1






active

oldest

votes

















up vote
1
down vote













You can remove the concatenation and the manual iteration over iterrows using pandas.Index.repeat; which uses numpy.repeat under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.



Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:



dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]


Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFoo…). Lastly, checking for NaNs should be done using np.isnan and not ==:



def __sub__(self, args):
columns = self.columns.tolist()
group_self = self.groupby(columns, as_index=False).size().reset_index()
group_args = args.groupby(columns, as_index=False).size().reset_index()

duplicated = group_args.merge(group_self, how='inner', on=columns)
repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
repetitions[repetitions < 0] = 0
duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
return uniques.append(duplicates_remaining).reset_index()





share|improve this answer























    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200626%2fnon-normalized-set-difference-algorithm%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote













    You can remove the concatenation and the manual iteration over iterrows using pandas.Index.repeat; which uses numpy.repeat under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.



    Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:



    dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
    dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
    dfNeededRepetitions[dfNeededRepetitions < 0] = 0
    dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]


    Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFoo…). Lastly, checking for NaNs should be done using np.isnan and not ==:



    def __sub__(self, args):
    columns = self.columns.tolist()
    group_self = self.groupby(columns, as_index=False).size().reset_index()
    group_args = args.groupby(columns, as_index=False).size().reset_index()

    duplicated = group_args.merge(group_self, how='inner', on=columns)
    repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
    repetitions[repetitions < 0] = 0
    duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

    uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
    uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
    return uniques.append(duplicates_remaining).reset_index()





    share|improve this answer



























      up vote
      1
      down vote













      You can remove the concatenation and the manual iteration over iterrows using pandas.Index.repeat; which uses numpy.repeat under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.



      Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:



      dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
      dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
      dfNeededRepetitions[dfNeededRepetitions < 0] = 0
      dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]


      Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFoo…). Lastly, checking for NaNs should be done using np.isnan and not ==:



      def __sub__(self, args):
      columns = self.columns.tolist()
      group_self = self.groupby(columns, as_index=False).size().reset_index()
      group_args = args.groupby(columns, as_index=False).size().reset_index()

      duplicated = group_args.merge(group_self, how='inner', on=columns)
      repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
      repetitions[repetitions < 0] = 0
      duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

      uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
      uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
      return uniques.append(duplicates_remaining).reset_index()





      share|improve this answer

























        up vote
        1
        down vote










        up vote
        1
        down vote









        You can remove the concatenation and the manual iteration over iterrows using pandas.Index.repeat; which uses numpy.repeat under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.



        Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:



        dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
        dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
        dfNeededRepetitions[dfNeededRepetitions < 0] = 0
        dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]


        Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFoo…). Lastly, checking for NaNs should be done using np.isnan and not ==:



        def __sub__(self, args):
        columns = self.columns.tolist()
        group_self = self.groupby(columns, as_index=False).size().reset_index()
        group_args = args.groupby(columns, as_index=False).size().reset_index()

        duplicated = group_args.merge(group_self, how='inner', on=columns)
        repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
        repetitions[repetitions < 0] = 0
        duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

        uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
        uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
        return uniques.append(duplicates_remaining).reset_index()





        share|improve this answer















        You can remove the concatenation and the manual iteration over iterrows using pandas.Index.repeat; which uses numpy.repeat under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.



        Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:



        dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
        dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
        dfNeededRepetitions[dfNeededRepetitions < 0] = 0
        dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]


        Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFoo…). Lastly, checking for NaNs should be done using np.isnan and not ==:



        def __sub__(self, args):
        columns = self.columns.tolist()
        group_self = self.groupby(columns, as_index=False).size().reset_index()
        group_args = args.groupby(columns, as_index=False).size().reset_index()

        duplicated = group_args.merge(group_self, how='inner', on=columns)
        repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
        repetitions[repetitions < 0] = 0
        duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

        uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
        uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
        return uniques.append(duplicates_remaining).reset_index()






        share|improve this answer















        share|improve this answer



        share|improve this answer








        edited Aug 1 at 8:11


























        answered Jul 31 at 14:46









        Mathias Ettinger

        21.7k32875




        21.7k32875






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200626%2fnon-normalized-set-difference-algorithm%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Chat program with C++ and SFML

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            Will my employers contract hold up in court?