Non normalized set difference algorithm
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
5
down vote
favorite
I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).
Objective:
DataSet1: DataSet2:
A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1
DataSet1 - DataSet2 = ResultSet
ResultSet:
A B C
1 6 5 1
2 4 4 3
Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.
The parameters of this exercise are such:
- Extra columns in the subtrahend (DataSet2) must be ignored.
- Instances of a record in DataSet1 that also exists in Dataset two
must be be removed from DataSet1 until either there are no
instances of the duplicate left in DataSet1 or there are no
instances left in DataSet2.
- In line with the above is a certian
record is duplicated 3 times in DataSet1 and once in DataSet2 then
two of those duplicates should remain in duplicate 1. Else if it's
the other way around 1-3 = -2 so all duplicates of that record are
removed from the returned DataSet.
- We must assume that the name
and number of columns, rows, index positions, are all unknown.
My Algorithm So Far:
import pandas as pd
import numpy as np
import copy
def __sub__(self, arg):
"""docstring"""
#Create a variable that holds the column names of self. We
# will use this filter and thus ignore any extra columns in arg
lstOutputColumns = self.columns.tolist()
#Group data into normalized sets we can use to break the data
# apart. These groups are returned usint pd.Dataframe.size() which
# also gives me the the count of times a record orccured in the
# origional data set (self & arg).
dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()
#Merge the normalized data so as to get all the data that in the
# subtrahend set (DataSet2) that matches a record in Dataset# and
# we can forget about the rest.
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)
#Add a calculated column to the merged subtrahend set to get the
# difference between column counts that our groupby.size() appened
# to each row two steps ago. This all done using iloc so as to
# avoid naming columns since I can't guarantee any particular column
# name isn't already in use.
dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)
#The result of the last three steps is a DataFrame with only
# rows that exist in both sets, with the count of the time each
# particular row exists on the far left of the table along with the
# difference between those counts. It should end up so that the
# last three columns of the DataFrame are
# (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
# Now we iterate through rows and construct a new data set based on
# the difference in the last column.
lstRows =
for index, row in dfMergedArg.iterrows():
if row.iloc[-1] > 0:
dictRow =
dictRow.update(row)
lstRows += [dictRow] * row[-1]
#Create a new dataframe with the rows we created in the the
#lst Variable.
dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)
#This next part is a simple left anti-join to get the rest of
# data out of DataSet1 that is unaffected by DataSet2.
dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]
#Now we put both datasets back together in a single DataFrame
dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()
#Return the result
return dfCombined[lstOutputColumns]
This works, the reason i've posted it here is because it's not very efficient. The creation of the multiple DataFrames during a run cause it to be a memory hog. Also, the use of iterrows() I feel is like a last resort that inevitably results in slow execution. I think the problem is interesting though because its about dealing with really un-ideal data situations that (lets face it) occur all the time.
Alright StackExchange - please rip me apart now!
python python-3.x pandas
add a comment |Â
up vote
5
down vote
favorite
I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).
Objective:
DataSet1: DataSet2:
A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1
DataSet1 - DataSet2 = ResultSet
ResultSet:
A B C
1 6 5 1
2 4 4 3
Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.
The parameters of this exercise are such:
- Extra columns in the subtrahend (DataSet2) must be ignored.
- Instances of a record in DataSet1 that also exists in Dataset two
must be be removed from DataSet1 until either there are no
instances of the duplicate left in DataSet1 or there are no
instances left in DataSet2.
- In line with the above is a certian
record is duplicated 3 times in DataSet1 and once in DataSet2 then
two of those duplicates should remain in duplicate 1. Else if it's
the other way around 1-3 = -2 so all duplicates of that record are
removed from the returned DataSet.
- We must assume that the name
and number of columns, rows, index positions, are all unknown.
My Algorithm So Far:
import pandas as pd
import numpy as np
import copy
def __sub__(self, arg):
"""docstring"""
#Create a variable that holds the column names of self. We
# will use this filter and thus ignore any extra columns in arg
lstOutputColumns = self.columns.tolist()
#Group data into normalized sets we can use to break the data
# apart. These groups are returned usint pd.Dataframe.size() which
# also gives me the the count of times a record orccured in the
# origional data set (self & arg).
dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()
#Merge the normalized data so as to get all the data that in the
# subtrahend set (DataSet2) that matches a record in Dataset# and
# we can forget about the rest.
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)
#Add a calculated column to the merged subtrahend set to get the
# difference between column counts that our groupby.size() appened
# to each row two steps ago. This all done using iloc so as to
# avoid naming columns since I can't guarantee any particular column
# name isn't already in use.
dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)
#The result of the last three steps is a DataFrame with only
# rows that exist in both sets, with the count of the time each
# particular row exists on the far left of the table along with the
# difference between those counts. It should end up so that the
# last three columns of the DataFrame are
# (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
# Now we iterate through rows and construct a new data set based on
# the difference in the last column.
lstRows =
for index, row in dfMergedArg.iterrows():
if row.iloc[-1] > 0:
dictRow =
dictRow.update(row)
lstRows += [dictRow] * row[-1]
#Create a new dataframe with the rows we created in the the
#lst Variable.
dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)
#This next part is a simple left anti-join to get the rest of
# data out of DataSet1 that is unaffected by DataSet2.
dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]
#Now we put both datasets back together in a single DataFrame
dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()
#Return the result
return dfCombined[lstOutputColumns]
This works, the reason i've posted it here is because it's not very efficient. The creation of the multiple DataFrames during a run cause it to be a memory hog. Also, the use of iterrows() I feel is like a last resort that inevitably results in slow execution. I think the problem is interesting though because its about dealing with really un-ideal data situations that (lets face it) occur all the time.
Alright StackExchange - please rip me apart now!
python python-3.x pandas
Any reason for naming it__sub__
? or is it a method inside some class?
â hjpotter92
Jul 31 at 13:10
Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â Jamie Marshall
Jul 31 at 16:20
add a comment |Â
up vote
5
down vote
favorite
up vote
5
down vote
favorite
I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).
Objective:
DataSet1: DataSet2:
A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1
DataSet1 - DataSet2 = ResultSet
ResultSet:
A B C
1 6 5 1
2 4 4 3
Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.
The parameters of this exercise are such:
- Extra columns in the subtrahend (DataSet2) must be ignored.
- Instances of a record in DataSet1 that also exists in Dataset two
must be be removed from DataSet1 until either there are no
instances of the duplicate left in DataSet1 or there are no
instances left in DataSet2.
- In line with the above is a certian
record is duplicated 3 times in DataSet1 and once in DataSet2 then
two of those duplicates should remain in duplicate 1. Else if it's
the other way around 1-3 = -2 so all duplicates of that record are
removed from the returned DataSet.
- We must assume that the name
and number of columns, rows, index positions, are all unknown.
My Algorithm So Far:
import pandas as pd
import numpy as np
import copy
def __sub__(self, arg):
"""docstring"""
#Create a variable that holds the column names of self. We
# will use this filter and thus ignore any extra columns in arg
lstOutputColumns = self.columns.tolist()
#Group data into normalized sets we can use to break the data
# apart. These groups are returned usint pd.Dataframe.size() which
# also gives me the the count of times a record orccured in the
# origional data set (self & arg).
dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()
#Merge the normalized data so as to get all the data that in the
# subtrahend set (DataSet2) that matches a record in Dataset# and
# we can forget about the rest.
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)
#Add a calculated column to the merged subtrahend set to get the
# difference between column counts that our groupby.size() appened
# to each row two steps ago. This all done using iloc so as to
# avoid naming columns since I can't guarantee any particular column
# name isn't already in use.
dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)
#The result of the last three steps is a DataFrame with only
# rows that exist in both sets, with the count of the time each
# particular row exists on the far left of the table along with the
# difference between those counts. It should end up so that the
# last three columns of the DataFrame are
# (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
# Now we iterate through rows and construct a new data set based on
# the difference in the last column.
lstRows =
for index, row in dfMergedArg.iterrows():
if row.iloc[-1] > 0:
dictRow =
dictRow.update(row)
lstRows += [dictRow] * row[-1]
#Create a new dataframe with the rows we created in the the
#lst Variable.
dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)
#This next part is a simple left anti-join to get the rest of
# data out of DataSet1 that is unaffected by DataSet2.
dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]
#Now we put both datasets back together in a single DataFrame
dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()
#Return the result
return dfCombined[lstOutputColumns]
This works, the reason i've posted it here is because it's not very efficient. The creation of the multiple DataFrames during a run cause it to be a memory hog. Also, the use of iterrows() I feel is like a last resort that inevitably results in slow execution. I think the problem is interesting though because its about dealing with really un-ideal data situations that (lets face it) occur all the time.
Alright StackExchange - please rip me apart now!
python python-3.x pandas
I'm trying to set up an algorithm in python for getting all sets from a set (DataSet1) less any instances of data in a second set (DataSet2).
Objective:
DataSet1: DataSet2:
A B C A B C D
1 6 5 1 1 4 4 3 1
2 4 4 3 2 4 4 3 1
3 4 4 3 3 6 5 3 1
4 4 4 3 4 5 3 1 1
5 3 2 3 5 3 2 3 1
DataSet1 - DataSet2 = ResultSet
ResultSet:
A B C
1 6 5 1
2 4 4 3
Notice that the data has many repeat rows and when the difference operation is applied, the number of duplicate instances in DataSet1 are subtracted from the duplicate instances in DataSet2.
The parameters of this exercise are such:
- Extra columns in the subtrahend (DataSet2) must be ignored.
- Instances of a record in DataSet1 that also exists in Dataset two
must be be removed from DataSet1 until either there are no
instances of the duplicate left in DataSet1 or there are no
instances left in DataSet2.
- In line with the above is a certian
record is duplicated 3 times in DataSet1 and once in DataSet2 then
two of those duplicates should remain in duplicate 1. Else if it's
the other way around 1-3 = -2 so all duplicates of that record are
removed from the returned DataSet.
- We must assume that the name
and number of columns, rows, index positions, are all unknown.
My Algorithm So Far:
import pandas as pd
import numpy as np
import copy
def __sub__(self, arg):
"""docstring"""
#Create a variable that holds the column names of self. We
# will use this filter and thus ignore any extra columns in arg
lstOutputColumns = self.columns.tolist()
#Group data into normalized sets we can use to break the data
# apart. These groups are returned usint pd.Dataframe.size() which
# also gives me the the count of times a record orccured in the
# origional data set (self & arg).
dfGroupArg = arg.groupby(arg.columns.tolist(),as_index=False).size().reset_index()
dfGroupSelf = self.groupby(lstOutputColumns,as_index=False).size().reset_index()
#Merge the normalized data so as to get all the data that in the
# subtrahend set (DataSet2) that matches a record in Dataset# and
# we can forget about the rest.
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how="inner", on=lstOutputColumns)
#Add a calculated column to the merged subtrahend set to get the
# difference between column counts that our groupby.size() appened
# to each row two steps ago. This all done using iloc so as to
# avoid naming columns since I can't guarantee any particular column
# name isn't already in use.
dfMergedArg = pd.concat([dfMergedArg, pd.Series(dfMergedArg.iloc[:,-1] - dfMergedArg.iloc[:,-2])], axis=1)
#The result of the last three steps is a DataFrame with only
# rows that exist in both sets, with the count of the time each
# particular row exists on the far left of the table along with the
# difference between those counts. It should end up so that the
# last three columns of the DataFrame are
# (DataSet2ct),(DataSet1ct),(DataSet1ct-DataSet2ct)
# Now we iterate through rows and construct a new data set based on
# the difference in the last column.
lstRows =
for index, row in dfMergedArg.iterrows():
if row.iloc[-1] > 0:
dictRow =
dictRow.update(row)
lstRows += [dictRow] * row[-1]
#Create a new dataframe with the rows we created in the the
#lst Variable.
dfLessArg = pd.DataFrame(lstRows, columns=lstOutputColumns)
#This next part is a simple left anti-join to get the rest of
# data out of DataSet1 that is unaffected by DataSet2.
dfMergedSelf = self.DataFrameIns.merge(dfGroupArg, how="left", on=lstOutputColumns)
dfMergedSelf = dfMergedSelf[dfMergedSelf[0] == np.nan]
#Now we put both datasets back together in a single DataFrame
dfCombined = dfMergedSelf.append(dfLessArgs).reset_index()
#Return the result
return dfCombined[lstOutputColumns]
This works, the reason i've posted it here is because it's not very efficient. The creation of the multiple DataFrames during a run cause it to be a memory hog. Also, the use of iterrows() I feel is like a last resort that inevitably results in slow execution. I think the problem is interesting though because its about dealing with really un-ideal data situations that (lets face it) occur all the time.
Alright StackExchange - please rip me apart now!
python python-3.x pandas
asked Jul 31 at 0:04
Jamie Marshall
1261
1261
Any reason for naming it__sub__
? or is it a method inside some class?
â hjpotter92
Jul 31 at 13:10
Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â Jamie Marshall
Jul 31 at 16:20
add a comment |Â
Any reason for naming it__sub__
? or is it a method inside some class?
â hjpotter92
Jul 31 at 13:10
Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â Jamie Marshall
Jul 31 at 16:20
Any reason for naming it
__sub__
? or is it a method inside some class?â hjpotter92
Jul 31 at 13:10
Any reason for naming it
__sub__
? or is it a method inside some class?â hjpotter92
Jul 31 at 13:10
Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â Jamie Marshall
Jul 31 at 16:20
Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â Jamie Marshall
Jul 31 at 16:20
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
You can remove the concatenation and the manual iteration over iterrows
using pandas.Index.repeat
; which uses numpy.repeat
under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.
Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc
and you can end up with:
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]
Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFooâ¦). Lastly, checking for NaN
s should be done using np.isnan
and not ==
:
def __sub__(self, args):
columns = self.columns.tolist()
group_self = self.groupby(columns, as_index=False).size().reset_index()
group_args = args.groupby(columns, as_index=False).size().reset_index()
duplicated = group_args.merge(group_self, how='inner', on=columns)
repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
repetitions[repetitions < 0] = 0
duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]
uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
return uniques.append(duplicates_remaining).reset_index()
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
You can remove the concatenation and the manual iteration over iterrows
using pandas.Index.repeat
; which uses numpy.repeat
under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.
Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc
and you can end up with:
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]
Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFooâ¦). Lastly, checking for NaN
s should be done using np.isnan
and not ==
:
def __sub__(self, args):
columns = self.columns.tolist()
group_self = self.groupby(columns, as_index=False).size().reset_index()
group_args = args.groupby(columns, as_index=False).size().reset_index()
duplicated = group_args.merge(group_self, how='inner', on=columns)
repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
repetitions[repetitions < 0] = 0
duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]
uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
return uniques.append(duplicates_remaining).reset_index()
add a comment |Â
up vote
1
down vote
You can remove the concatenation and the manual iteration over iterrows
using pandas.Index.repeat
; which uses numpy.repeat
under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.
Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc
and you can end up with:
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]
Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFooâ¦). Lastly, checking for NaN
s should be done using np.isnan
and not ==
:
def __sub__(self, args):
columns = self.columns.tolist()
group_self = self.groupby(columns, as_index=False).size().reset_index()
group_args = args.groupby(columns, as_index=False).size().reset_index()
duplicated = group_args.merge(group_self, how='inner', on=columns)
repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
repetitions[repetitions < 0] = 0
duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]
uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
return uniques.append(duplicates_remaining).reset_index()
add a comment |Â
up vote
1
down vote
up vote
1
down vote
You can remove the concatenation and the manual iteration over iterrows
using pandas.Index.repeat
; which uses numpy.repeat
under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.
Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc
and you can end up with:
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]
Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFooâ¦). Lastly, checking for NaN
s should be done using np.isnan
and not ==
:
def __sub__(self, args):
columns = self.columns.tolist()
group_self = self.groupby(columns, as_index=False).size().reset_index()
group_args = args.groupby(columns, as_index=False).size().reset_index()
duplicated = group_args.merge(group_self, how='inner', on=columns)
repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
repetitions[repetitions < 0] = 0
duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]
uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
return uniques.append(duplicates_remaining).reset_index()
You can remove the concatenation and the manual iteration over iterrows
using pandas.Index.repeat
; which uses numpy.repeat
under the hood. You can feed this function an int, and each index will be repeated this amount of time; or an array of ints and each index will be repeated the amount of time the corresponding entry in the array.
Combine that with filtering negative values and accessing elements by index using pandas.DataFrame.loc
and you can end up with:
dfMergedArg = dfGroupArg.merge(dfGroupSelf, how='inner', on=lstOutputColumns)
dfNeededRepetitions = dfMergedArg.iloc[:, -1] - dfMergedArg.iloc[:, -2]
dfNeededRepetitions[dfNeededRepetitions < 0] = 0
dfLessArg = dfMergedArg.loc[dfMergedArg.index.repeat(dfNeededRepetitions)][lstOutputColumns]
Now the rest of the code would benefit a bit from PEP8, naming style (lower_case_with_underscore for variable names) and by not prefixing variable names with their type (dfSomething, lstFooâ¦). Lastly, checking for NaN
s should be done using np.isnan
and not ==
:
def __sub__(self, args):
columns = self.columns.tolist()
group_self = self.groupby(columns, as_index=False).size().reset_index()
group_args = args.groupby(columns, as_index=False).size().reset_index()
duplicated = group_args.merge(group_self, how='inner', on=columns)
repetitions = duplicated.iloc[:, -1] - duplicated.iloc[:, -2]
repetitions[repetitions < 0] = 0
duplicates_remaining = duplicated.loc[duplicated.index.repeat(repetitions)][columns]
uniques = self.DataFrameIns.merge(group_args, how='left', on=columns)
uniques = uniques[np.isnan(uniques.iloc[:, -1])][columns]
return uniques.append(duplicates_remaining).reset_index()
edited Aug 1 at 8:11
answered Jul 31 at 14:46
Mathias Ettinger
21.7k32875
21.7k32875
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200626%2fnon-normalized-set-difference-algorithm%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Any reason for naming it
__sub__
? or is it a method inside some class?â hjpotter92
Jul 31 at 13:10
Its named sub because I overwrote the '-' operator in a class where i'm implementing it. The algorithm is what's important though. If the algorithm's good I could implement anywhere.
â Jamie Marshall
Jul 31 at 16:20