Importing text into PANDAS and counting certain words

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite












Aim: To improve the speed of the following code. Current timing is about 80~ hours :0



Purpose: The code imports a dataset which contains 1.9 million rows and two columns. One of these columns contain text posts of var length. I then loop through each of these rows and query the post against an imported function that returns a specific counter of variable length. The counter tells me about the presence of certain words in the text. On average the func takes less than 1 ms to return this counter. (Timer for the "Func" inserted at the end to prove this)



Overheads: The code i'm looking to improve is the loop. I accept a certain level of overhead with the "func" which can't be improved at this minute. I have considered looking at Spark or Dask to parallelize the loop and speed up the process. Suggestions are welcome



#Import data
import pandas as pd
from func import func
data = pd.read_csv('Dataset.csv')

print(len(data))
>> 1900000

print(data.columns)
>> Index(['type', 'body'], dtype='object')


#Create new DF
data2 = pd.Dataframe()

for post in data['post']:
post = str(post)
scores = func.countWords(posts)
data2 = data2.append(scores,ignore_index=True)

print(scores)
>> Counter(0: 306,
1: 185,
2: 61,
45: 31,
87: 23,
92: 5,
94: 3,
102: 30,)


import time
start = time.time()
score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
end = time.time()
print(end - start)
>> 0.0019948482513427734






share|improve this question





















  • Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
    – Zach
    Jul 31 at 16:56
















up vote
1
down vote

favorite












Aim: To improve the speed of the following code. Current timing is about 80~ hours :0



Purpose: The code imports a dataset which contains 1.9 million rows and two columns. One of these columns contain text posts of var length. I then loop through each of these rows and query the post against an imported function that returns a specific counter of variable length. The counter tells me about the presence of certain words in the text. On average the func takes less than 1 ms to return this counter. (Timer for the "Func" inserted at the end to prove this)



Overheads: The code i'm looking to improve is the loop. I accept a certain level of overhead with the "func" which can't be improved at this minute. I have considered looking at Spark or Dask to parallelize the loop and speed up the process. Suggestions are welcome



#Import data
import pandas as pd
from func import func
data = pd.read_csv('Dataset.csv')

print(len(data))
>> 1900000

print(data.columns)
>> Index(['type', 'body'], dtype='object')


#Create new DF
data2 = pd.Dataframe()

for post in data['post']:
post = str(post)
scores = func.countWords(posts)
data2 = data2.append(scores,ignore_index=True)

print(scores)
>> Counter(0: 306,
1: 185,
2: 61,
45: 31,
87: 23,
92: 5,
94: 3,
102: 30,)


import time
start = time.time()
score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
end = time.time()
print(end - start)
>> 0.0019948482513427734






share|improve this question





















  • Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
    – Zach
    Jul 31 at 16:56












up vote
1
down vote

favorite









up vote
1
down vote

favorite











Aim: To improve the speed of the following code. Current timing is about 80~ hours :0



Purpose: The code imports a dataset which contains 1.9 million rows and two columns. One of these columns contain text posts of var length. I then loop through each of these rows and query the post against an imported function that returns a specific counter of variable length. The counter tells me about the presence of certain words in the text. On average the func takes less than 1 ms to return this counter. (Timer for the "Func" inserted at the end to prove this)



Overheads: The code i'm looking to improve is the loop. I accept a certain level of overhead with the "func" which can't be improved at this minute. I have considered looking at Spark or Dask to parallelize the loop and speed up the process. Suggestions are welcome



#Import data
import pandas as pd
from func import func
data = pd.read_csv('Dataset.csv')

print(len(data))
>> 1900000

print(data.columns)
>> Index(['type', 'body'], dtype='object')


#Create new DF
data2 = pd.Dataframe()

for post in data['post']:
post = str(post)
scores = func.countWords(posts)
data2 = data2.append(scores,ignore_index=True)

print(scores)
>> Counter(0: 306,
1: 185,
2: 61,
45: 31,
87: 23,
92: 5,
94: 3,
102: 30,)


import time
start = time.time()
score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
end = time.time()
print(end - start)
>> 0.0019948482513427734






share|improve this question













Aim: To improve the speed of the following code. Current timing is about 80~ hours :0



Purpose: The code imports a dataset which contains 1.9 million rows and two columns. One of these columns contain text posts of var length. I then loop through each of these rows and query the post against an imported function that returns a specific counter of variable length. The counter tells me about the presence of certain words in the text. On average the func takes less than 1 ms to return this counter. (Timer for the "Func" inserted at the end to prove this)



Overheads: The code i'm looking to improve is the loop. I accept a certain level of overhead with the "func" which can't be improved at this minute. I have considered looking at Spark or Dask to parallelize the loop and speed up the process. Suggestions are welcome



#Import data
import pandas as pd
from func import func
data = pd.read_csv('Dataset.csv')

print(len(data))
>> 1900000

print(data.columns)
>> Index(['type', 'body'], dtype='object')


#Create new DF
data2 = pd.Dataframe()

for post in data['post']:
post = str(post)
scores = func.countWords(posts)
data2 = data2.append(scores,ignore_index=True)

print(scores)
>> Counter(0: 306,
1: 185,
2: 61,
45: 31,
87: 23,
92: 5,
94: 3,
102: 30,)


import time
start = time.time()
score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
end = time.time()
print(end - start)
>> 0.0019948482513427734








share|improve this question












share|improve this question




share|improve this question








edited Aug 1 at 4:10









200_success

123k14143398




123k14143398









asked Jul 31 at 13:08









F.D

113




113











  • Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
    – Zach
    Jul 31 at 16:56
















  • Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
    – Zach
    Jul 31 at 16:56















Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
– Zach
Jul 31 at 16:56




Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
– Zach
Jul 31 at 16:56










1 Answer
1






active

oldest

votes

















up vote
0
down vote













Several errors:




I then loop through each of these loops




I take it you mean "I then loop through each of these columns".




for post in data['post]:




missing end quote mark




scores = Func.countWords(posts)




You imported func (lowercase) and now you're calling Func (uppercase)




data2 = data2.append(scores,ignore_index=True)




append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:



def post_to_count(post):
return func.countWord(str(post))

scores = data['post'].apply(post_to_count)





share|improve this answer





















  • I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
    – F.D
    Jul 31 at 16:16










  • What about scores = data.post.str.apply(func.countWord)?
    – Mathias Ettinger
    Jul 31 at 20:05










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200662%2fimporting-text-into-pandas-and-counting-certain-words%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote













Several errors:




I then loop through each of these loops




I take it you mean "I then loop through each of these columns".




for post in data['post]:




missing end quote mark




scores = Func.countWords(posts)




You imported func (lowercase) and now you're calling Func (uppercase)




data2 = data2.append(scores,ignore_index=True)




append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:



def post_to_count(post):
return func.countWord(str(post))

scores = data['post'].apply(post_to_count)





share|improve this answer





















  • I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
    – F.D
    Jul 31 at 16:16










  • What about scores = data.post.str.apply(func.countWord)?
    – Mathias Ettinger
    Jul 31 at 20:05














up vote
0
down vote













Several errors:




I then loop through each of these loops




I take it you mean "I then loop through each of these columns".




for post in data['post]:




missing end quote mark




scores = Func.countWords(posts)




You imported func (lowercase) and now you're calling Func (uppercase)




data2 = data2.append(scores,ignore_index=True)




append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:



def post_to_count(post):
return func.countWord(str(post))

scores = data['post'].apply(post_to_count)





share|improve this answer





















  • I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
    – F.D
    Jul 31 at 16:16










  • What about scores = data.post.str.apply(func.countWord)?
    – Mathias Ettinger
    Jul 31 at 20:05












up vote
0
down vote










up vote
0
down vote









Several errors:




I then loop through each of these loops




I take it you mean "I then loop through each of these columns".




for post in data['post]:




missing end quote mark




scores = Func.countWords(posts)




You imported func (lowercase) and now you're calling Func (uppercase)




data2 = data2.append(scores,ignore_index=True)




append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:



def post_to_count(post):
return func.countWord(str(post))

scores = data['post'].apply(post_to_count)





share|improve this answer













Several errors:




I then loop through each of these loops




I take it you mean "I then loop through each of these columns".




for post in data['post]:




missing end quote mark




scores = Func.countWords(posts)




You imported func (lowercase) and now you're calling Func (uppercase)




data2 = data2.append(scores,ignore_index=True)




append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:



def post_to_count(post):
return func.countWord(str(post))

scores = data['post'].apply(post_to_count)






share|improve this answer













share|improve this answer



share|improve this answer











answered Jul 31 at 16:05









Acccumulation

9395




9395











  • I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
    – F.D
    Jul 31 at 16:16










  • What about scores = data.post.str.apply(func.countWord)?
    – Mathias Ettinger
    Jul 31 at 20:05
















  • I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
    – F.D
    Jul 31 at 16:16










  • What about scores = data.post.str.apply(func.countWord)?
    – Mathias Ettinger
    Jul 31 at 20:05















I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
– F.D
Jul 31 at 16:16




I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
– F.D
Jul 31 at 16:16












What about scores = data.post.str.apply(func.countWord)?
– Mathias Ettinger
Jul 31 at 20:05




What about scores = data.post.str.apply(func.countWord)?
– Mathias Ettinger
Jul 31 at 20:05












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200662%2fimporting-text-into-pandas-and-counting-certain-words%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Greedy Best First Search implementation in Rust

Function to Return a JSON Like Objects Using VBA Collections and Arrays

C++11 CLH Lock Implementation