Determine file provider based on column headers, replace columns and process data

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

Each day I receive files from different customers relaying the same information, but the customers use proprietary formats. For example,

Customer1Data = 'User':['BobJones','BobJones','BobJones'],
'UserSong':['Song1','Song2','Song3'],
'UserSplit':[0.50,0.25,0.50],
'UserEarnings':[100,200,300]

Customer2Data = 'Name':['BobJones','BobJones','BobJones'],
'SongName':['Song1','Song2','Song3'],
'SongSplit':[0.50,0.25,0.50],
'SongEarnings':[100,200,300]

I would like to be able to read in whatever CSV file is present in the download directory, replace the column names and then do summations based on replaced column names. Right now, I do this.

from collections import namedtuple
_customers = namedtuple('CustomerName',['statement_headers','first_column'])
Customer1 = _customers(statement_headers = 'User':'Name',
'UserSong':'Song',
'UserSplit':'Split',
'UserEarnings' :'Earnings',
first_column = 'User')

Customer2 = _customers(statement_headers = 'Name':'Name',
'SongName':'Song',
'SongSplit':'Split',
'SongEarnings':'Earnings',
first_column = 'Name')

I then create a dictionary with the first column from each customers csv file:

_customertypes = 'Customer1':Customer1,'Customer2':Customer2
def _create_first_col_dictionary():
 first_col = 
 for k,v in _customertypes.items():
 first_col[v.first_column] = k
 return first_col

Create a dictionary to hold the csv based on the file type, read in the header row and assign to csv_dict

import pandas as pd
csv_dict = k: for k in _customertypes.keys()
files = [r'C:locationcsv1.csv',r'C:locationcsv2.csv',r'C:locationcsv3.csv']
for file in files:
 x = pd.read_csv(file,nrows = 2,encoding = 'LATIN-1')
 col = x.columns.tolist()[0]
 file_type = _first_col[col]
 csv_dict[file_type].append(file)

Now I am left with this:

csv_dict = 'Customer1':[r'C:locationcsv1.csv',r'C:locationcsv3.csv'],
'Customer2':[r'C:locationcsv2.csv']

Finally, I merge the like csv files together for Customer1 and Customer2, then replace column names using the dictionary above and do analysis.

I am curious if anyone has better suggestions on how to do something like this. I think it could require a level of maintenance (adding new columns to the dictionaries, adding the first_column name, etc).

edited Jun 5 at 3:01

Billal BEGUERADJ

asked May 14 at 15:55

chris dorn

12815

add a commentÂ |Â

up vote
2
down vote

favorite

Each day I receive files from different customers relaying the same information, but the customers use proprietary formats. For example,

Customer1Data = 'User':['BobJones','BobJones','BobJones'],
'UserSong':['Song1','Song2','Song3'],
'UserSplit':[0.50,0.25,0.50],
'UserEarnings':[100,200,300]

Customer2Data = 'Name':['BobJones','BobJones','BobJones'],
'SongName':['Song1','Song2','Song3'],
'SongSplit':[0.50,0.25,0.50],
'SongEarnings':[100,200,300]

I would like to be able to read in whatever CSV file is present in the download directory, replace the column names and then do summations based on replaced column names. Right now, I do this.

from collections import namedtuple
_customers = namedtuple('CustomerName',['statement_headers','first_column'])
Customer1 = _customers(statement_headers = 'User':'Name',
'UserSong':'Song',
'UserSplit':'Split',
'UserEarnings' :'Earnings',
first_column = 'User')

Customer2 = _customers(statement_headers = 'Name':'Name',
'SongName':'Song',
'SongSplit':'Split',
'SongEarnings':'Earnings',
first_column = 'Name')

I then create a dictionary with the first column from each customers csv file:

_customertypes = 'Customer1':Customer1,'Customer2':Customer2
def _create_first_col_dictionary():
 first_col = 
 for k,v in _customertypes.items():
 first_col[v.first_column] = k
 return first_col

Create a dictionary to hold the csv based on the file type, read in the header row and assign to csv_dict

import pandas as pd
csv_dict = k: for k in _customertypes.keys()
files = [r'C:locationcsv1.csv',r'C:locationcsv2.csv',r'C:locationcsv3.csv']
for file in files:
 x = pd.read_csv(file,nrows = 2,encoding = 'LATIN-1')
 col = x.columns.tolist()[0]
 file_type = _first_col[col]
 csv_dict[file_type].append(file)

Now I am left with this:

csv_dict = 'Customer1':[r'C:locationcsv1.csv',r'C:locationcsv3.csv'],
'Customer2':[r'C:locationcsv2.csv']

Finally, I merge the like csv files together for Customer1 and Customer2, then replace column names using the dictionary above and do analysis.

edited Jun 5 at 3:01

Billal BEGUERADJ

asked May 14 at 15:55

chris dorn

12815

add a commentÂ |Â

up vote
2
down vote

favorite

Each day I receive files from different customers relaying the same information, but the customers use proprietary formats. For example,

Customer1Data = 'User':['BobJones','BobJones','BobJones'],
'UserSong':['Song1','Song2','Song3'],
'UserSplit':[0.50,0.25,0.50],
'UserEarnings':[100,200,300]

Customer2Data = 'Name':['BobJones','BobJones','BobJones'],
'SongName':['Song1','Song2','Song3'],
'SongSplit':[0.50,0.25,0.50],
'SongEarnings':[100,200,300]

I would like to be able to read in whatever CSV file is present in the download directory, replace the column names and then do summations based on replaced column names. Right now, I do this.

from collections import namedtuple
_customers = namedtuple('CustomerName',['statement_headers','first_column'])
Customer1 = _customers(statement_headers = 'User':'Name',
'UserSong':'Song',
'UserSplit':'Split',
'UserEarnings' :'Earnings',
first_column = 'User')

Customer2 = _customers(statement_headers = 'Name':'Name',
'SongName':'Song',
'SongSplit':'Split',
'SongEarnings':'Earnings',
first_column = 'Name')

I then create a dictionary with the first column from each customers csv file:

_customertypes = 'Customer1':Customer1,'Customer2':Customer2
def _create_first_col_dictionary():
 first_col = 
 for k,v in _customertypes.items():
 first_col[v.first_column] = k
 return first_col

Create a dictionary to hold the csv based on the file type, read in the header row and assign to csv_dict

import pandas as pd
csv_dict = k: for k in _customertypes.keys()
files = [r'C:locationcsv1.csv',r'C:locationcsv2.csv',r'C:locationcsv3.csv']
for file in files:
 x = pd.read_csv(file,nrows = 2,encoding = 'LATIN-1')
 col = x.columns.tolist()[0]
 file_type = _first_col[col]
 csv_dict[file_type].append(file)

Now I am left with this:

csv_dict = 'Customer1':[r'C:locationcsv1.csv',r'C:locationcsv3.csv'],
'Customer2':[r'C:locationcsv2.csv']

Finally, I merge the like csv files together for Customer1 and Customer2, then replace column names using the dictionary above and do analysis.

edited Jun 5 at 3:01

Billal BEGUERADJ

asked May 14 at 15:55

chris dorn

12815

Each day I receive files from different customers relaying the same information, but the customers use proprietary formats. For example,

Customer1Data = 'User':['BobJones','BobJones','BobJones'],
'UserSong':['Song1','Song2','Song3'],
'UserSplit':[0.50,0.25,0.50],
'UserEarnings':[100,200,300]

Customer2Data = 'Name':['BobJones','BobJones','BobJones'],
'SongName':['Song1','Song2','Song3'],
'SongSplit':[0.50,0.25,0.50],
'SongEarnings':[100,200,300]

I would like to be able to read in whatever CSV file is present in the download directory, replace the column names and then do summations based on replaced column names. Right now, I do this.

from collections import namedtuple
_customers = namedtuple('CustomerName',['statement_headers','first_column'])
Customer1 = _customers(statement_headers = 'User':'Name',
'UserSong':'Song',
'UserSplit':'Split',
'UserEarnings' :'Earnings',
first_column = 'User')

Customer2 = _customers(statement_headers = 'Name':'Name',
'SongName':'Song',
'SongSplit':'Split',
'SongEarnings':'Earnings',
first_column = 'Name')

I then create a dictionary with the first column from each customers csv file:

_customertypes = 'Customer1':Customer1,'Customer2':Customer2
def _create_first_col_dictionary():
 first_col = 
 for k,v in _customertypes.items():
 first_col[v.first_column] = k
 return first_col

Create a dictionary to hold the csv based on the file type, read in the header row and assign to csv_dict

import pandas as pd
csv_dict = k: for k in _customertypes.keys()
files = [r'C:locationcsv1.csv',r'C:locationcsv2.csv',r'C:locationcsv3.csv']
for file in files:
 x = pd.read_csv(file,nrows = 2,encoding = 'LATIN-1')
 col = x.columns.tolist()[0]
 file_type = _first_col[col]
 csv_dict[file_type].append(file)

Now I am left with this:

csv_dict = 'Customer1':[r'C:locationcsv1.csv',r'C:locationcsv3.csv'],
'Customer2':[r'C:locationcsv2.csv']

Finally, I merge the like csv files together for Customer1 and Customer2, then replace column names using the dictionary above and do analysis.

edited Jun 5 at 3:01

Billal BEGUERADJ

asked May 14 at 15:55

chris dorn

12815

edited Jun 5 at 3:01

Billal BEGUERADJ

edited Jun 5 at 3:01

Billal BEGUERADJ

edited Jun 5 at 3:01

Billal BEGUERADJ

asked May 14 at 15:55

chris dorn

12815

asked May 14 at 15:55

chris dorn

12815

asked May 14 at 15:55

chris dorn

12815

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f194369%2fdetermine-file-provider-based-on-column-headers-replace-columns-and-process-dat%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

hRo55Cdoo29tdlGj8,AhKmDgcIWbbF5rKLQwPhomMyKEmjHoM7 KC8clP39ldyZkkxWxjWWl,gKuad6Ca2H33J18fukz J

搜尋此網誌

trjhtr