Regression on Pandas DataFrame

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
2
down vote

favorite

I am working on the following assignment and I am a bit lost:

Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating. You will be judged
on the process of building the model (regularization, subset
selection, validation set, etc.) and not so much on the accuracy of
the results.

This is what I currently have as code (I use mord's API for the regression model since Rating is categorical):

# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)

# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]

# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)

# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)

Could I have some constructive criticism on the above code please (regarding what I am missing from the posed task)?

This is what bow looks like:

enter image description here

If you need any other information just comment I will happily edit it in.

P.S: I've been working on this the past few days, this is my first regression model I code. I'll be honest, I didn't find it easy. So if someone could lend a helping hand I would be so ever grateful. Lastly, I want to be clear that I don't post to get a complete answer, code-wise (even if I would not say no to that). I am merely hoping in some guidance. My deadline approaches fast :P

edited Jul 15 at 2:16

200_success

123k14143399

asked Jul 14 at 15:15

Stefano Pozzi

112

The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â€“Â Mathias Ettinger
Jul 15 at 10:26

No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â€“Â Stefano Pozzi
Jul 15 at 11:32

add a commentÂ |Â

up vote
2
down vote

favorite

I am working on the following assignment and I am a bit lost:

Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating. You will be judged
on the process of building the model (regularization, subset
selection, validation set, etc.) and not so much on the accuracy of
the results.

This is what I currently have as code (I use mord's API for the regression model since Rating is categorical):

# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)

# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]

# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)

# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)

Could I have some constructive criticism on the above code please (regarding what I am missing from the posed task)?

This is what bow looks like:

enter image description here

If you need any other information just comment I will happily edit it in.

edited Jul 15 at 2:16

200_success

123k14143399

asked Jul 14 at 15:15

Stefano Pozzi

112

The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â€“Â Mathias Ettinger
Jul 15 at 10:26

No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â€“Â Stefano Pozzi
Jul 15 at 11:32

add a commentÂ |Â

up vote
2
down vote

favorite

I am working on the following assignment and I am a bit lost:

Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating. You will be judged
on the process of building the model (regularization, subset
selection, validation set, etc.) and not so much on the accuracy of
the results.

This is what I currently have as code (I use mord's API for the regression model since Rating is categorical):

# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)

# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]

# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)

# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)

Could I have some constructive criticism on the above code please (regarding what I am missing from the posed task)?

This is what bow looks like:

enter image description here

If you need any other information just comment I will happily edit it in.

edited Jul 15 at 2:16

200_success

123k14143399

asked Jul 14 at 15:15

Stefano Pozzi

112

I am working on the following assignment and I am a bit lost:

Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating. You will be judged
on the process of building the model (regularization, subset
selection, validation set, etc.) and not so much on the accuracy of
the results.

This is what I currently have as code (I use mord's API for the regression model since Rating is categorical):

# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)

# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]

# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)

# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)

Could I have some constructive criticism on the above code please (regarding what I am missing from the posed task)?

This is what bow looks like:

enter image description here

If you need any other information just comment I will happily edit it in.

edited Jul 15 at 2:16

200_success

123k14143399

asked Jul 14 at 15:15

Stefano Pozzi

112

edited Jul 15 at 2:16

200_success

123k14143399

edited Jul 15 at 2:16

200_success

123k14143399

edited Jul 15 at 2:16

200_success

123k14143399

asked Jul 14 at 15:15

Stefano Pozzi

112

asked Jul 14 at 15:15

Stefano Pozzi

112

asked Jul 14 at 15:15

Stefano Pozzi

112

The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â€“Â Mathias Ettinger
Jul 15 at 10:26

No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â€“Â Stefano Pozzi
Jul 15 at 11:32

add a commentÂ |Â

The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â€“Â Mathias Ettinger
Jul 15 at 10:26

No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â€“Â Stefano Pozzi
Jul 15 at 11:32

The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â€“Â Mathias Ettinger
Jul 15 at 10:26

No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â€“Â Stefano Pozzi
Jul 15 at 11:32

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199491%2fregression-on-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr