Importing text into PANDAS and counting certain words

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
1
down vote

favorite

Aim: To improve the speed of the following code. Current timing is about 80~ hours :0

Purpose: The code imports a dataset which contains 1.9 million rows and two columns. One of these columns contain text posts of var length. I then loop through each of these rows and query the post against an imported function that returns a specific counter of variable length. The counter tells me about the presence of certain words in the text. On average the func takes less than 1 ms to return this counter. (Timer for the "Func" inserted at the end to prove this)

Overheads: The code i'm looking to improve is the loop. I accept a certain level of overhead with the "func" which can't be improved at this minute. I have considered looking at Spark or Dask to parallelize the loop and speed up the process. Suggestions are welcome

#Import data
import pandas as pd 
from func import func 
data = pd.read_csv('Dataset.csv')

print(len(data))
>> 1900000

print(data.columns)
>> Index(['type', 'body'], dtype='object')


#Create new DF
data2 = pd.Dataframe()

for post in data['post']:
 post = str(post)
 scores = func.countWords(posts)
 data2 = data2.append(scores,ignore_index=True)

print(scores)
>> Counter(0: 306,
 1: 185,
 2: 61,
 45: 31,
 87: 23,
 92: 5,
 94: 3,
 102: 30,)


 import time
 start = time.time()
 score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
 end = time.time()
 print(end - start)
 >> 0.0019948482513427734

edited Aug 1 at 4:10

200_success

123k14143398

asked Jul 31 at 13:08

F.D

113

Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â€“Â Zach
Jul 31 at 16:56

add a commentÂ |Â

up vote
1
down vote

favorite

Aim: To improve the speed of the following code. Current timing is about 80~ hours :0

#Import data
import pandas as pd 
from func import func 
data = pd.read_csv('Dataset.csv')

print(len(data))
>> 1900000

print(data.columns)
>> Index(['type', 'body'], dtype='object')


#Create new DF
data2 = pd.Dataframe()

for post in data['post']:
 post = str(post)
 scores = func.countWords(posts)
 data2 = data2.append(scores,ignore_index=True)

print(scores)
>> Counter(0: 306,
 1: 185,
 2: 61,
 45: 31,
 87: 23,
 92: 5,
 94: 3,
 102: 30,)


 import time
 start = time.time()
 score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
 end = time.time()
 print(end - start)
 >> 0.0019948482513427734

edited Aug 1 at 4:10

200_success

123k14143398

asked Jul 31 at 13:08

F.D

113

Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â€“Â Zach
Jul 31 at 16:56

add a commentÂ |Â

up vote
1
down vote

favorite

Aim: To improve the speed of the following code. Current timing is about 80~ hours :0

#Import data
import pandas as pd 
from func import func 
data = pd.read_csv('Dataset.csv')

print(len(data))
>> 1900000

print(data.columns)
>> Index(['type', 'body'], dtype='object')


#Create new DF
data2 = pd.Dataframe()

for post in data['post']:
 post = str(post)
 scores = func.countWords(posts)
 data2 = data2.append(scores,ignore_index=True)

print(scores)
>> Counter(0: 306,
 1: 185,
 2: 61,
 45: 31,
 87: 23,
 92: 5,
 94: 3,
 102: 30,)


 import time
 start = time.time()
 score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
 end = time.time()
 print(end - start)
 >> 0.0019948482513427734

edited Aug 1 at 4:10

200_success

123k14143398

asked Jul 31 at 13:08

F.D

113

Aim: To improve the speed of the following code. Current timing is about 80~ hours :0

#Import data
import pandas as pd 
from func import func 
data = pd.read_csv('Dataset.csv')

print(len(data))
>> 1900000

print(data.columns)
>> Index(['type', 'body'], dtype='object')


#Create new DF
data2 = pd.Dataframe()

for post in data['post']:
 post = str(post)
 scores = func.countWords(posts)
 data2 = data2.append(scores,ignore_index=True)

print(scores)
>> Counter(0: 306,
 1: 185,
 2: 61,
 45: 31,
 87: 23,
 92: 5,
 94: 3,
 102: 30,)


 import time
 start = time.time()
 score = func.countWords("Slow down Sir, you're going to give yourself skin faliure!")
 end = time.time()
 print(end - start)
 >> 0.0019948482513427734

edited Aug 1 at 4:10

200_success

123k14143398

asked Jul 31 at 13:08

F.D

113

edited Aug 1 at 4:10

200_success

123k14143398

edited Aug 1 at 4:10

200_success

123k14143398

edited Aug 1 at 4:10

200_success

123k14143398

asked Jul 31 at 13:08

F.D

113

asked Jul 31 at 13:08

F.D

113

asked Jul 31 at 13:08

F.D

113

Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â€“Â Zach
Jul 31 at 16:56

add a commentÂ |Â

Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â€“Â Zach
Jul 31 at 16:56

Any appreciable speedup here will come from optimizing countWords - I"d advise that you post that function for review.
â€“Â Zach
Jul 31 at 16:56

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
0
down vote

Several errors:

I then loop through each of these loops

I take it you mean "I then loop through each of these columns".

for post in data['post]:

missing end quote mark

scores = Func.countWords(posts)

You imported func (lowercase) and now you're calling Func (uppercase)

data2 = data2.append(scores,ignore_index=True)

append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:

def post_to_count(post):
 return func.countWord(str(post))

scores = data['post'].apply(post_to_count)

answered Jul 31 at 16:05

Acccumulation

9395

I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â€“Â F.D
Jul 31 at 16:16

What about scores = data.post.str.apply(func.countWord)?
â€“Â Mathias Ettinger
Jul 31 at 20:05

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200662%2fimporting-text-into-pandas-and-counting-certain-words%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

Several errors:

I then loop through each of these loops

I take it you mean "I then loop through each of these columns".

for post in data['post]:

missing end quote mark

scores = Func.countWords(posts)

You imported func (lowercase) and now you're calling Func (uppercase)

data2 = data2.append(scores,ignore_index=True)

append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:

def post_to_count(post):
 return func.countWord(str(post))

scores = data['post'].apply(post_to_count)

answered Jul 31 at 16:05

Acccumulation

9395

I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â€“Â F.D
Jul 31 at 16:16

What about scores = data.post.str.apply(func.countWord)?
â€“Â Mathias Ettinger
Jul 31 at 20:05

add a commentÂ |Â

up vote
0
down vote

Several errors:

I then loop through each of these loops

I take it you mean "I then loop through each of these columns".

for post in data['post]:

missing end quote mark

scores = Func.countWords(posts)

You imported func (lowercase) and now you're calling Func (uppercase)

data2 = data2.append(scores,ignore_index=True)

append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:

def post_to_count(post):
 return func.countWord(str(post))

scores = data['post'].apply(post_to_count)

answered Jul 31 at 16:05

Acccumulation

9395

I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â€“Â F.D
Jul 31 at 16:16

What about scores = data.post.str.apply(func.countWord)?
â€“Â Mathias Ettinger
Jul 31 at 20:05

add a commentÂ |Â

up vote
0
down vote

Several errors:

I then loop through each of these loops

I take it you mean "I then loop through each of these columns".

for post in data['post]:

missing end quote mark

scores = Func.countWords(posts)

You imported func (lowercase) and now you're calling Func (uppercase)

data2 = data2.append(scores,ignore_index=True)

append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:

def post_to_count(post):
 return func.countWord(str(post))

scores = data['post'].apply(post_to_count)

answered Jul 31 at 16:05

Acccumulation

9395

Several errors:

I then loop through each of these loops

I take it you mean "I then loop through each of these columns".

for post in data['post]:

missing end quote mark

scores = Func.countWords(posts)

You imported func (lowercase) and now you're calling Func (uppercase)

data2 = data2.append(scores,ignore_index=True)

append should take a row-type object. If the function returns a numeric, then you shouldn't be appending it. Instead you can do:

def post_to_count(post):
 return func.countWord(str(post))

scores = data['post'].apply(post_to_count)

answered Jul 31 at 16:05

Acccumulation

9395

answered Jul 31 at 16:05

Acccumulation

9395

answered Jul 31 at 16:05

Acccumulation

9395

answered Jul 31 at 16:05

Acccumulation

9395

I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â€“Â F.D
Jul 31 at 16:16

What about scores = data.post.str.apply(func.countWord)?
â€“Â Mathias Ettinger
Jul 31 at 20:05

add a commentÂ |Â

I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â€“Â F.D
Jul 31 at 16:16

What about scores = data.post.str.apply(func.countWord)?
â€“Â Mathias Ettinger
Jul 31 at 20:05

I've fixed those spelling and syntaxs errors. Your .apply works fine, but 1) it returns a combined dataframe, 2) does it increase speed up?
â€“Â F.D
Jul 31 at 16:16

What about scores = data.post.str.apply(func.countWord)?
â€“Â Mathias Ettinger
Jul 31 at 20:05

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr