Importing text into PANDAS and counting certain words
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
1
down vote
favorite
Aim: To improve the speed of the following code. Current timing is about 80~ hours :0
Purpose: The code imports a dataset which contains 1.9 million rows and two columns. One of these columns contain text posts of var length. I then loop through each of these rows and query the post against an imported function that returns a specific counter of variable length. The counter tells me about the presence of certain words in the text. On average the func takes less than 1 ms to return this counter. (Timer for the "Func" inserted at the end to prove this)
Overheads: The code i'm looking to improve is the loop. I accept a certain level of overhead with the "func" which can't be improved at this minute. I have considered looking at Spark or Dask to parallelize the loop and speed up the process. Suggestions are welcome
#Import data
import pandas as pd
from func import func
data = pd.read_csv('Dataset.csv')
print(len(data))
>> 1900000
print(data.columns)
>> Index(['type', 'body'], dtype='object')
#Create new DF
data2 = pd.Dataframe()
for post in data['post']:
post = str(post)
scores = func.countWords(posts)
data2 = data2.append(scores,ignore_index=True)
print(scores)
>> Counter(0: 306,
1: 185,
2: 61,
45: 31,
87: 23,
92: 5,
94: 3,
102: 30,)
import time
start = time.time()
score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
end = time.time()
print(end - start)
>> 0.0019948482513427734
python performance pandas
add a comment |Â
up vote
1
down vote
favorite
Aim: To improve the speed of the following code. Current timing is about 80~ hours :0
Purpose: The code imports a dataset which contains 1.9 million rows and two columns. One of these columns contain text posts of var length. I then loop through each of these rows and query the post against an imported function that returns a specific counter of variable length. The counter tells me about the presence of certain words in the text. On average the func takes less than 1 ms to return this counter. (Timer for the "Func" inserted at the end to prove this)
Overheads: The code i'm looking to improve is the loop. I accept a certain level of overhead with the "func" which can't be improved at this minute. I have considered looking at Spark or Dask to parallelize the loop and speed up the process. Suggestions are welcome
#Import data
import pandas as pd
from func import func
data = pd.read_csv('Dataset.csv')
print(len(data))
>> 1900000
print(data.columns)
>> Index(['type', 'body'], dtype='object')
#Create new DF
data2 = pd.Dataframe()
for post in data['post']:
post = str(post)
scores = func.countWords(posts)
data2 = data2.append(scores,ignore_index=True)
print(scores)
>> Counter(0: 306,
1: 185,
2: 61,
45: 31,
87: 23,
92: 5,
94: 3,
102: 30,)
import time
start = time.time()
score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
end = time.time()
print(end - start)
>> 0.0019948482513427734
python performance pandas
Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â Zach
Jul 31 at 16:56
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Aim: To improve the speed of the following code. Current timing is about 80~ hours :0
Purpose: The code imports a dataset which contains 1.9 million rows and two columns. One of these columns contain text posts of var length. I then loop through each of these rows and query the post against an imported function that returns a specific counter of variable length. The counter tells me about the presence of certain words in the text. On average the func takes less than 1 ms to return this counter. (Timer for the "Func" inserted at the end to prove this)
Overheads: The code i'm looking to improve is the loop. I accept a certain level of overhead with the "func" which can't be improved at this minute. I have considered looking at Spark or Dask to parallelize the loop and speed up the process. Suggestions are welcome
#Import data
import pandas as pd
from func import func
data = pd.read_csv('Dataset.csv')
print(len(data))
>> 1900000
print(data.columns)
>> Index(['type', 'body'], dtype='object')
#Create new DF
data2 = pd.Dataframe()
for post in data['post']:
post = str(post)
scores = func.countWords(posts)
data2 = data2.append(scores,ignore_index=True)
print(scores)
>> Counter(0: 306,
1: 185,
2: 61,
45: 31,
87: 23,
92: 5,
94: 3,
102: 30,)
import time
start = time.time()
score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
end = time.time()
print(end - start)
>> 0.0019948482513427734
python performance pandas
Aim: To improve the speed of the following code. Current timing is about 80~ hours :0
Purpose: The code imports a dataset which contains 1.9 million rows and two columns. One of these columns contain text posts of var length. I then loop through each of these rows and query the post against an imported function that returns a specific counter of variable length. The counter tells me about the presence of certain words in the text. On average the func takes less than 1 ms to return this counter. (Timer for the "Func" inserted at the end to prove this)
Overheads: The code i'm looking to improve is the loop. I accept a certain level of overhead with the "func" which can't be improved at this minute. I have considered looking at Spark or Dask to parallelize the loop and speed up the process. Suggestions are welcome
#Import data
import pandas as pd
from func import func
data = pd.read_csv('Dataset.csv')
print(len(data))
>> 1900000
print(data.columns)
>> Index(['type', 'body'], dtype='object')
#Create new DF
data2 = pd.Dataframe()
for post in data['post']:
post = str(post)
scores = func.countWords(posts)
data2 = data2.append(scores,ignore_index=True)
print(scores)
>> Counter(0: 306,
1: 185,
2: 61,
45: 31,
87: 23,
92: 5,
94: 3,
102: 30,)
import time
start = time.time()
score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
end = time.time()
print(end - start)
>> 0.0019948482513427734
python performance pandas
edited Aug 1 at 4:10
200_success
123k14143398
123k14143398
asked Jul 31 at 13:08
F.D
113
113
Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â Zach
Jul 31 at 16:56
add a comment |Â
Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â Zach
Jul 31 at 16:56
Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â Zach
Jul 31 at 16:56
Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â Zach
Jul 31 at 16:56
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
0
down vote
Several errors:
I then loop through each of these loops
I take it you mean "I then loop through each of these columns".
for post in data['post]:
missing end quote mark
scores = Func.countWords(posts)
You imported func
(lowercase) and now you're calling Func
(uppercase)
data2 = data2.append(scores,ignore_index=True)
append
should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:
def post_to_count(post):
return func.countWord(str(post))
scores = data['post'].apply(post_to_count)
I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â F.D
Jul 31 at 16:16
What aboutscores = data.post.str.apply(func.countWord)
?
â Mathias Ettinger
Jul 31 at 20:05
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Several errors:
I then loop through each of these loops
I take it you mean "I then loop through each of these columns".
for post in data['post]:
missing end quote mark
scores = Func.countWords(posts)
You imported func
(lowercase) and now you're calling Func
(uppercase)
data2 = data2.append(scores,ignore_index=True)
append
should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:
def post_to_count(post):
return func.countWord(str(post))
scores = data['post'].apply(post_to_count)
I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â F.D
Jul 31 at 16:16
What aboutscores = data.post.str.apply(func.countWord)
?
â Mathias Ettinger
Jul 31 at 20:05
add a comment |Â
up vote
0
down vote
Several errors:
I then loop through each of these loops
I take it you mean "I then loop through each of these columns".
for post in data['post]:
missing end quote mark
scores = Func.countWords(posts)
You imported func
(lowercase) and now you're calling Func
(uppercase)
data2 = data2.append(scores,ignore_index=True)
append
should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:
def post_to_count(post):
return func.countWord(str(post))
scores = data['post'].apply(post_to_count)
I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â F.D
Jul 31 at 16:16
What aboutscores = data.post.str.apply(func.countWord)
?
â Mathias Ettinger
Jul 31 at 20:05
add a comment |Â
up vote
0
down vote
up vote
0
down vote
Several errors:
I then loop through each of these loops
I take it you mean "I then loop through each of these columns".
for post in data['post]:
missing end quote mark
scores = Func.countWords(posts)
You imported func
(lowercase) and now you're calling Func
(uppercase)
data2 = data2.append(scores,ignore_index=True)
append
should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:
def post_to_count(post):
return func.countWord(str(post))
scores = data['post'].apply(post_to_count)
Several errors:
I then loop through each of these loops
I take it you mean "I then loop through each of these columns".
for post in data['post]:
missing end quote mark
scores = Func.countWords(posts)
You imported func
(lowercase) and now you're calling Func
(uppercase)
data2 = data2.append(scores,ignore_index=True)
append
should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:
def post_to_count(post):
return func.countWord(str(post))
scores = data['post'].apply(post_to_count)
answered Jul 31 at 16:05
Acccumulation
9395
9395
I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â F.D
Jul 31 at 16:16
What aboutscores = data.post.str.apply(func.countWord)
?
â Mathias Ettinger
Jul 31 at 20:05
add a comment |Â
I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â F.D
Jul 31 at 16:16
What aboutscores = data.post.str.apply(func.countWord)
?
â Mathias Ettinger
Jul 31 at 20:05
I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â F.D
Jul 31 at 16:16
I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â F.D
Jul 31 at 16:16
What about
scores = data.post.str.apply(func.countWord)
?â Mathias Ettinger
Jul 31 at 20:05
What about
scores = data.post.str.apply(func.countWord)
?â Mathias Ettinger
Jul 31 at 20:05
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200662%2fimporting-text-into-pandas-and-counting-certain-words%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â Zach
Jul 31 at 16:56