Regression on Pandas DataFrame
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I am working on the following assignment and I am a bit lost:
Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating. You will be judged
on the process of building the model (regularization, subset
selection, validation set, etc.) and not so much on the accuracy of
the results.
This is what I currently have as code (I use mord
's API for the regression model since Rating
is categorical):
# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)
# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]
# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)
# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)
Could I have some constructive criticism on the above code please (regarding what I am missing from the posed task)?
This is what bow
looks like:
If you need any other information just comment I will happily edit it in.
P.S: I've been working on this the past few days, this is my first regression model I code. I'll be honest, I didn't find it easy. So if someone could lend a helping hand I would be so ever grateful. Lastly, I want to be clear that I don't post to get a complete answer, code-wise (even if I would not say no to that). I am merely hoping in some guidance. My deadline approaches fast :P
python homework statistics pandas clustering
add a comment |Â
up vote
2
down vote
favorite
I am working on the following assignment and I am a bit lost:
Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating. You will be judged
on the process of building the model (regularization, subset
selection, validation set, etc.) and not so much on the accuracy of
the results.
This is what I currently have as code (I use mord
's API for the regression model since Rating
is categorical):
# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)
# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]
# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)
# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)
Could I have some constructive criticism on the above code please (regarding what I am missing from the posed task)?
This is what bow
looks like:
If you need any other information just comment I will happily edit it in.
P.S: I've been working on this the past few days, this is my first regression model I code. I'll be honest, I didn't find it easy. So if someone could lend a helping hand I would be so ever grateful. Lastly, I want to be clear that I don't post to get a complete answer, code-wise (even if I would not say no to that). I am merely hoping in some guidance. My deadline approaches fast :P
python homework statistics pandas clustering
The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â Mathias Ettinger
Jul 15 at 10:26
No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â Stefano Pozzi
Jul 15 at 11:32
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am working on the following assignment and I am a bit lost:
Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating. You will be judged
on the process of building the model (regularization, subset
selection, validation set, etc.) and not so much on the accuracy of
the results.
This is what I currently have as code (I use mord
's API for the regression model since Rating
is categorical):
# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)
# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]
# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)
# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)
Could I have some constructive criticism on the above code please (regarding what I am missing from the posed task)?
This is what bow
looks like:
If you need any other information just comment I will happily edit it in.
P.S: I've been working on this the past few days, this is my first regression model I code. I'll be honest, I didn't find it easy. So if someone could lend a helping hand I would be so ever grateful. Lastly, I want to be clear that I don't post to get a complete answer, code-wise (even if I would not say no to that). I am merely hoping in some guidance. My deadline approaches fast :P
python homework statistics pandas clustering
I am working on the following assignment and I am a bit lost:
Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating. You will be judged
on the process of building the model (regularization, subset
selection, validation set, etc.) and not so much on the accuracy of
the results.
This is what I currently have as code (I use mord
's API for the regression model since Rating
is categorical):
# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)
# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]
# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)
# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy')
# scores -> array([0.75438596, 0.73684211, 0.66071429, 0.53571429, 0.60714286])
# avg_score -> Accuracy: 0.66 (+/- 0.16)
Could I have some constructive criticism on the above code please (regarding what I am missing from the posed task)?
This is what bow
looks like:
If you need any other information just comment I will happily edit it in.
P.S: I've been working on this the past few days, this is my first regression model I code. I'll be honest, I didn't find it easy. So if someone could lend a helping hand I would be so ever grateful. Lastly, I want to be clear that I don't post to get a complete answer, code-wise (even if I would not say no to that). I am merely hoping in some guidance. My deadline approaches fast :P
python homework statistics pandas clustering
edited Jul 15 at 2:16
200_success
123k14143399
123k14143399
asked Jul 14 at 15:15
Stefano Pozzi
112
112
The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â Mathias Ettinger
Jul 15 at 10:26
No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â Stefano Pozzi
Jul 15 at 11:32
add a comment |Â
The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â Mathias Ettinger
Jul 15 at 10:26
No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â Stefano Pozzi
Jul 15 at 11:32
The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â Mathias Ettinger
Jul 15 at 10:26
The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â Mathias Ettinger
Jul 15 at 10:26
No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â Stefano Pozzi
Jul 15 at 11:32
No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â Stefano Pozzi
Jul 15 at 11:32
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f199491%2fregression-on-pandas-dataframe%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
The wording of the question feels like the code does not work as expected and you're asking about implementing missing features. Do I understand correctly?
â Mathias Ettinger
Jul 15 at 10:26
No, I believe that the code works correctly but I am wondering if I "hit all the targets".
â Stefano Pozzi
Jul 15 at 11:32