Non normalized set difference algorithm

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
5
down vote

favorite

I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).

Objective:

DataSet1: DataSet2:
 A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1

DataSet1 - DataSet2 = ResultSet

ResultSet:
 A B C
1 6 5 1
2 4 4 3

Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.

The parameters of this exercise are such:

Extra columns in the subtrahend (DataSet2) must be ignored.

Instances of a record in DataSet1 that also exists in Dataset two
must be be removed from DataSet1 until either there are no
instances of the duplicate left in DataSet1 or there are no
instances left in DataSet2.

In line with the above is a certian
record is duplicated 3 times in DataSet1 and once in DataSet2 then
two of those duplicates should remain in duplicate 1. Else if it's
the other way around 1-3 = -2 so all duplicates of that record are
removed from the returned DataSet.

We must assume that the name
and number of columns, rows, index positions, are all unknown.

My Algorithm So Far:

import pandas as pd
import numpy as np
import copy

def __sub__(self, arg):
 """docstring"""

 #Create a variable that holds the column names of self. We
 # will use this filter and thus ignore any extra columns in arg
 lstOutputColumns = self.columns.tolist()

 #Group data into normalized sets we can use to break the data
 # apart. These groups are returned usint pd.Dataframe.size() which
 # also gives me the the count of times a record orccured in the 
 # origional data set (self & arg).
 dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
 dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()

 #Merge the normalized data so as to get all the data that in the 
 # subtrahend set (DataSet2) that matches a record in Dataset# and
 # we can forget about the rest.
 dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)

 #Add a calculated column to the merged subtrahend set to get the
 # difference between column counts that our groupby.size() appened
 # to each row two steps ago. This all done using iloc so as to
 # avoid naming columns since I can't guarantee any particular column
 # name isn't already in use.
 dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)


 #The result of the last three steps is a DataFrame with only 
 # rows that exist in both sets, with the count of the time each
 # particular row exists on the far left of the table along with the
 # difference between those counts. It should end up so that the
 # last three columns of the DataFrame are 
 # (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
 # Now we iterate through rows and construct a new data set based on 
 # the difference in the last column.
 lstRows = 
 for index, row in dfMergedArg.iterrows():
 if row.iloc[-1] > 0:
 dictRow = 
 dictRow.update(row)
 lstRows += [dictRow] * row[-1]

 #Create a new dataframe with the rows we created in the the 
 #lst Variable.
 dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)

 #This next part is a simple left anti-join to get the rest of 
 # data out of DataSet1 that is unaffected by DataSet2.
 dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
 dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]

 #Now we put both datasets back together in a single DataFrame
 dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()

 #Return the result
 return dfCombined[lstOutputColumns]

This works, the reason i've posted it here is because it's not very efficient. The creation of the multiple DataFrames during a run cause it to be a memory hog. Also, the use of iterrows() I feel is like a last resort that inevitably results in slow execution. I think the problem is interesting though because its about dealing with really un-ideal data situations that (lets face it) occur all the time.

Alright StackExchange - please rip me apart now!

asked Jul 31 at 0:04

Jamie Marshall

1261

Any reason for naming it __sub__? or is it a method inside some class?
â€“Â hjpotter92
Jul 31 at 13:10

Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â€“Â Jamie Marshall
Jul 31 at 16:20

add a commentÂ |Â

up vote
5
down vote

favorite

I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).

Objective:

DataSet1: DataSet2:
 A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1

DataSet1 - DataSet2 = ResultSet

ResultSet:
 A B C
1 6 5 1
2 4 4 3

Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.

The parameters of this exercise are such:

Extra columns in the subtrahend (DataSet2) must be ignored.

Instances of a record in DataSet1 that also exists in Dataset two
must be be removed from DataSet1 until either there are no
instances of the duplicate left in DataSet1 or there are no
instances left in DataSet2.

In line with the above is a certian
record is duplicated 3 times in DataSet1 and once in DataSet2 then
two of those duplicates should remain in duplicate 1. Else if it's
the other way around 1-3 = -2 so all duplicates of that record are
removed from the returned DataSet.

We must assume that the name
and number of columns, rows, index positions, are all unknown.

My Algorithm So Far:

import pandas as pd
import numpy as np
import copy

def __sub__(self, arg):
 """docstring"""

 #Create a variable that holds the column names of self. We
 # will use this filter and thus ignore any extra columns in arg
 lstOutputColumns = self.columns.tolist()

 #Group data into normalized sets we can use to break the data
 # apart. These groups are returned usint pd.Dataframe.size() which
 # also gives me the the count of times a record orccured in the 
 # origional data set (self & arg).
 dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
 dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()

 #Merge the normalized data so as to get all the data that in the 
 # subtrahend set (DataSet2) that matches a record in Dataset# and
 # we can forget about the rest.
 dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)

 #Add a calculated column to the merged subtrahend set to get the
 # difference between column counts that our groupby.size() appened
 # to each row two steps ago. This all done using iloc so as to
 # avoid naming columns since I can't guarantee any particular column
 # name isn't already in use.
 dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)


 #The result of the last three steps is a DataFrame with only 
 # rows that exist in both sets, with the count of the time each
 # particular row exists on the far left of the table along with the
 # difference between those counts. It should end up so that the
 # last three columns of the DataFrame are 
 # (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
 # Now we iterate through rows and construct a new data set based on 
 # the difference in the last column.
 lstRows = 
 for index, row in dfMergedArg.iterrows():
 if row.iloc[-1] > 0:
 dictRow = 
 dictRow.update(row)
 lstRows += [dictRow] * row[-1]

 #Create a new dataframe with the rows we created in the the 
 #lst Variable.
 dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)

 #This next part is a simple left anti-join to get the rest of 
 # data out of DataSet1 that is unaffected by DataSet2.
 dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
 dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]

 #Now we put both datasets back together in a single DataFrame
 dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()

 #Return the result
 return dfCombined[lstOutputColumns]

Alright StackExchange - please rip me apart now!

asked Jul 31 at 0:04

Jamie Marshall

1261

Any reason for naming it __sub__? or is it a method inside some class?
â€“Â hjpotter92
Jul 31 at 13:10

Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â€“Â Jamie Marshall
Jul 31 at 16:20

add a commentÂ |Â

up vote
5
down vote

favorite

I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).

Objective:

DataSet1: DataSet2:
 A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1

DataSet1 - DataSet2 = ResultSet

ResultSet:
 A B C
1 6 5 1
2 4 4 3

Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.

The parameters of this exercise are such:

Extra columns in the subtrahend (DataSet2) must be ignored.

Instances of a record in DataSet1 that also exists in Dataset two
must be be removed from DataSet1 until either there are no
instances of the duplicate left in DataSet1 or there are no
instances left in DataSet2.

In line with the above is a certian
record is duplicated 3 times in DataSet1 and once in DataSet2 then
two of those duplicates should remain in duplicate 1. Else if it's
the other way around 1-3 = -2 so all duplicates of that record are
removed from the returned DataSet.

We must assume that the name
and number of columns, rows, index positions, are all unknown.

My Algorithm So Far:

import pandas as pd
import numpy as np
import copy

def __sub__(self, arg):
 """docstring"""

 #Create a variable that holds the column names of self. We
 # will use this filter and thus ignore any extra columns in arg
 lstOutputColumns = self.columns.tolist()

 #Group data into normalized sets we can use to break the data
 # apart. These groups are returned usint pd.Dataframe.size() which
 # also gives me the the count of times a record orccured in the 
 # origional data set (self & arg).
 dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
 dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()

 #Merge the normalized data so as to get all the data that in the 
 # subtrahend set (DataSet2) that matches a record in Dataset# and
 # we can forget about the rest.
 dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)

 #Add a calculated column to the merged subtrahend set to get the
 # difference between column counts that our groupby.size() appened
 # to each row two steps ago. This all done using iloc so as to
 # avoid naming columns since I can't guarantee any particular column
 # name isn't already in use.
 dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)


 #The result of the last three steps is a DataFrame with only 
 # rows that exist in both sets, with the count of the time each
 # particular row exists on the far left of the table along with the
 # difference between those counts. It should end up so that the
 # last three columns of the DataFrame are 
 # (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
 # Now we iterate through rows and construct a new data set based on 
 # the difference in the last column.
 lstRows = 
 for index, row in dfMergedArg.iterrows():
 if row.iloc[-1] > 0:
 dictRow = 
 dictRow.update(row)
 lstRows += [dictRow] * row[-1]

 #Create a new dataframe with the rows we created in the the 
 #lst Variable.
 dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)

 #This next part is a simple left anti-join to get the rest of 
 # data out of DataSet1 that is unaffected by DataSet2.
 dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
 dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]

 #Now we put both datasets back together in a single DataFrame
 dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()

 #Return the result
 return dfCombined[lstOutputColumns]

Alright StackExchange - please rip me apart now!

asked Jul 31 at 0:04

Jamie Marshall

1261

I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).

Objective:

DataSet1: DataSet2:
 A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1

DataSet1 - DataSet2 = ResultSet

ResultSet:
 A B C
1 6 5 1
2 4 4 3

Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.

The parameters of this exercise are such:

Extra columns in the subtrahend (DataSet2) must be ignored.

Instances of a record in DataSet1 that also exists in Dataset two
must be be removed from DataSet1 until either there are no
instances of the duplicate left in DataSet1 or there are no
instances left in DataSet2.

In line with the above is a certian
record is duplicated 3 times in DataSet1 and once in DataSet2 then
two of those duplicates should remain in duplicate 1. Else if it's
the other way around 1-3 = -2 so all duplicates of that record are
removed from the returned DataSet.

We must assume that the name
and number of columns, rows, index positions, are all unknown.

My Algorithm So Far:

import pandas as pd
import numpy as np
import copy

def __sub__(self, arg):
 """docstring"""

 #Create a variable that holds the column names of self. We
 # will use this filter and thus ignore any extra columns in arg
 lstOutputColumns = self.columns.tolist()

 #Group data into normalized sets we can use to break the data
 # apart. These groups are returned usint pd.Dataframe.size() which
 # also gives me the the count of times a record orccured in the 
 # origional data set (self & arg).
 dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
 dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()

 #Merge the normalized data so as to get all the data that in the 
 # subtrahend set (DataSet2) that matches a record in Dataset# and
 # we can forget about the rest.
 dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)

 #Add a calculated column to the merged subtrahend set to get the
 # difference between column counts that our groupby.size() appened
 # to each row two steps ago. This all done using iloc so as to
 # avoid naming columns since I can't guarantee any particular column
 # name isn't already in use.
 dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)


 #The result of the last three steps is a DataFrame with only 
 # rows that exist in both sets, with the count of the time each
 # particular row exists on the far left of the table along with the
 # difference between those counts. It should end up so that the
 # last three columns of the DataFrame are 
 # (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
 # Now we iterate through rows and construct a new data set based on 
 # the difference in the last column.
 lstRows = 
 for index, row in dfMergedArg.iterrows():
 if row.iloc[-1] > 0:
 dictRow = 
 dictRow.update(row)
 lstRows += [dictRow] * row[-1]

 #Create a new dataframe with the rows we created in the the 
 #lst Variable.
 dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)

 #This next part is a simple left anti-join to get the rest of 
 # data out of DataSet1 that is unaffected by DataSet2.
 dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
 dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]

 #Now we put both datasets back together in a single DataFrame
 dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()

 #Return the result
 return dfCombined[lstOutputColumns]

Alright StackExchange - please rip me apart now!

asked Jul 31 at 0:04

Jamie Marshall

1261

asked Jul 31 at 0:04

Jamie Marshall

1261

asked Jul 31 at 0:04

Jamie Marshall

1261

asked Jul 31 at 0:04

Jamie Marshall

1261

Any reason for naming it __sub__? or is it a method inside some class?
â€“Â hjpotter92
Jul 31 at 13:10

Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â€“Â Jamie Marshall
Jul 31 at 16:20

add a commentÂ |Â

Any reason for naming it __sub__? or is it a method inside some class?
â€“Â hjpotter92
Jul 31 at 13:10

Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â€“Â Jamie Marshall
Jul 31 at 16:20

Any reason for naming it __sub__? or is it a method inside some class?
â€“Â hjpotter92
Jul 31 at 13:10

Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â€“Â Jamie Marshall
Jul 31 at 16:20

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
1
down vote

You can remove the concatenation and the manual iteration over iterrows using pandas.Index.repeat; which uses numpy.repeat under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.

Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:

dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]

Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFooÃ¢Â€Â¦). Lastly, checking for NaNs should be done using np.isnan and not ==:

def __sub__(self, args):
 columns = self.columns.tolist()
 group_self = self.groupby(columns, as_index=False).size().reset_index()
 group_args = args.groupby(columns, as_index=False).size().reset_index()

 duplicated = group_args.merge(group_self, how='inner', on=columns)
 repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
 repetitions[repetitions < 0] = 0
 duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

 uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
 uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
 return uniques.append(duplicates_remaining).reset_index()

edited Aug 1 at 8:11

answered Jul 31 at 14:46

Mathias Ettinger

21.7k32875

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200626%2fnon-normalized-set-difference-algorithm%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:

dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]

def __sub__(self, args):
 columns = self.columns.tolist()
 group_self = self.groupby(columns, as_index=False).size().reset_index()
 group_args = args.groupby(columns, as_index=False).size().reset_index()

 duplicated = group_args.merge(group_self, how='inner', on=columns)
 repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
 repetitions[repetitions < 0] = 0
 duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

 uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
 uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
 return uniques.append(duplicates_remaining).reset_index()

edited Aug 1 at 8:11

answered Jul 31 at 14:46

Mathias Ettinger

21.7k32875

add a commentÂ |Â

up vote
1
down vote

Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:

dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]

def __sub__(self, args):
 columns = self.columns.tolist()
 group_self = self.groupby(columns, as_index=False).size().reset_index()
 group_args = args.groupby(columns, as_index=False).size().reset_index()

 duplicated = group_args.merge(group_self, how='inner', on=columns)
 repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
 repetitions[repetitions < 0] = 0
 duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

 uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
 uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
 return uniques.append(duplicates_remaining).reset_index()

edited Aug 1 at 8:11

answered Jul 31 at 14:46

Mathias Ettinger

21.7k32875

add a commentÂ |Â

up vote
1
down vote

Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:

dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]

def __sub__(self, args):
 columns = self.columns.tolist()
 group_self = self.groupby(columns, as_index=False).size().reset_index()
 group_args = args.groupby(columns, as_index=False).size().reset_index()

 duplicated = group_args.merge(group_self, how='inner', on=columns)
 repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
 repetitions[repetitions < 0] = 0
 duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

 uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
 uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
 return uniques.append(duplicates_remaining).reset_index()

edited Aug 1 at 8:11

answered Jul 31 at 14:46

Mathias Ettinger

21.7k32875

Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc and you can end up with:

dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]

def __sub__(self, args):
 columns = self.columns.tolist()
 group_self = self.groupby(columns, as_index=False).size().reset_index()
 group_args = args.groupby(columns, as_index=False).size().reset_index()

 duplicated = group_args.merge(group_self, how='inner', on=columns)
 repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
 repetitions[repetitions < 0] = 0
 duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]

 uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
 uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
 return uniques.append(duplicates_remaining).reset_index()

edited Aug 1 at 8:11

answered Jul 31 at 14:46

Mathias Ettinger

21.7k32875

edited Aug 1 at 8:11

answered Jul 31 at 14:46

Mathias Ettinger

21.7k32875

answered Jul 31 at 14:46

Mathias Ettinger

21.7k32875

answered Jul 31 at 14:46

Mathias Ettinger

21.7k32875

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr