Filter out non-alphabetic characters from a list of words

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
3
down vote

favorite

For coding practice / interview exercises, I'd like to know if there's an optimizaton I can make to the following, where I "clean" a given word to remove punctuation or other characters that are not within "a" to "z".

There are some great answers here to remove punctuation from a string, so my question today is not the best way how to do this, but instead whether there is an optimization I can make to my 3 lines of code below in the word_count_engine function? Can I do this in 1 or 2 lines or make the code more efficient so it doesn't loop over the list twice (i.e. with 2 list comprehensions)?

def clean(word):
 returnword = ""
 for letter in word.lower():
 if letter >= 'a' and letter <='z':
 # not out of bounds
 returnword += letter
 return returnword


def word_count_engine(document):

 words = document.split() # if there are extra spaces, split() still filters empty words out FYI
 words = [clean(word) for word in words] # a word like "$33!" will result in an empty string though
 words = [word for word in words if word] # so filter out empty strings and get the final list of clean words

document = "Practice makes perfect. you'll only get Perfect by practice. just practice! $544 test"

edited Apr 4 at 20:55

200_success

123k14142399

asked Apr 4 at 20:42

rishijd

1585

add a commentÂ |Â

up vote
3
down vote

favorite

def clean(word):
 returnword = ""
 for letter in word.lower():
 if letter >= 'a' and letter <='z':
 # not out of bounds
 returnword += letter
 return returnword


def word_count_engine(document):

 words = document.split() # if there are extra spaces, split() still filters empty words out FYI
 words = [clean(word) for word in words] # a word like "$33!" will result in an empty string though
 words = [word for word in words if word] # so filter out empty strings and get the final list of clean words

document = "Practice makes perfect. you'll only get Perfect by practice. just practice! $544 test"

edited Apr 4 at 20:55

200_success

123k14142399

asked Apr 4 at 20:42

rishijd

1585

add a commentÂ |Â

up vote
3
down vote

favorite

def clean(word):
 returnword = ""
 for letter in word.lower():
 if letter >= 'a' and letter <='z':
 # not out of bounds
 returnword += letter
 return returnword


def word_count_engine(document):

 words = document.split() # if there are extra spaces, split() still filters empty words out FYI
 words = [clean(word) for word in words] # a word like "$33!" will result in an empty string though
 words = [word for word in words if word] # so filter out empty strings and get the final list of clean words

document = "Practice makes perfect. you'll only get Perfect by practice. just practice! $544 test"

edited Apr 4 at 20:55

200_success

123k14142399

asked Apr 4 at 20:42

rishijd

1585

def clean(word):
 returnword = ""
 for letter in word.lower():
 if letter >= 'a' and letter <='z':
 # not out of bounds
 returnword += letter
 return returnword


def word_count_engine(document):

 words = document.split() # if there are extra spaces, split() still filters empty words out FYI
 words = [clean(word) for word in words] # a word like "$33!" will result in an empty string though
 words = [word for word in words if word] # so filter out empty strings and get the final list of clean words

document = "Practice makes perfect. you'll only get Perfect by practice. just practice! $544 test"

edited Apr 4 at 20:55

200_success

123k14142399

asked Apr 4 at 20:42

rishijd

1585

edited Apr 4 at 20:55

200_success

123k14142399

edited Apr 4 at 20:55

200_success

123k14142399

edited Apr 4 at 20:55

200_success

123k14142399

asked Apr 4 at 20:42

rishijd

1585

asked Apr 4 at 20:42

rishijd

1585

asked Apr 4 at 20:42

rishijd

1585

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
6
down vote

accepted

Since Python strings are immutable, appending one character at a time using += is inefficient. You end up allocating a new string, copying all of the old string, then writing one character.

Instead, clean() should be written like this:

def clean(word):
 return ''.join(letter for letter in word.lower() if 'a' <= letter <= 'z')

Note that Python supports double-ended inequalities.

The name of your word_count_engine function poorly describes what it does. In fact, the function doesn't print or return anything, so it's all dead code. If I had to rewrite it, though, I'd say:

words = [word for word in map(clean, document.split()) if word]

Also consider replacing all of this code with a simple regular expression substitution.

answered Apr 4 at 21:03

200_success

123k14142399

Awesome, thanks! I learn more than I expect through Stack Exchange thanks to people like you! Re: name of function - sorry, it's because the function is actually for something more detailed than I have described, and the above lines are just the first few lines of the function. I should have made that clear in the question/renamed it. I'll practice with regex functions next.
â€“Â rishijd
Apr 5 at 0:39

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f191279%2ffilter-out-non-alphabetic-characters-from-a-list-of-words%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
6
down vote

accepted

Since Python strings are immutable, appending one character at a time using += is inefficient. You end up allocating a new string, copying all of the old string, then writing one character.

Instead, clean() should be written like this:

def clean(word):
 return ''.join(letter for letter in word.lower() if 'a' <= letter <= 'z')

Note that Python supports double-ended inequalities.

The name of your word_count_engine function poorly describes what it does. In fact, the function doesn't print or return anything, so it's all dead code. If I had to rewrite it, though, I'd say:

words = [word for word in map(clean, document.split()) if word]

Also consider replacing all of this code with a simple regular expression substitution.

answered Apr 4 at 21:03

200_success

123k14142399

Awesome, thanks! I learn more than I expect through Stack Exchange thanks to people like you! Re: name of function - sorry, it's because the function is actually for something more detailed than I have described, and the above lines are just the first few lines of the function. I should have made that clear in the question/renamed it. I'll practice with regex functions next.
â€“Â rishijd
Apr 5 at 0:39

add a commentÂ |Â

up vote
6
down vote

accepted

Since Python strings are immutable, appending one character at a time using += is inefficient. You end up allocating a new string, copying all of the old string, then writing one character.

Instead, clean() should be written like this:

def clean(word):
 return ''.join(letter for letter in word.lower() if 'a' <= letter <= 'z')

Note that Python supports double-ended inequalities.

The name of your word_count_engine function poorly describes what it does. In fact, the function doesn't print or return anything, so it's all dead code. If I had to rewrite it, though, I'd say:

words = [word for word in map(clean, document.split()) if word]

Also consider replacing all of this code with a simple regular expression substitution.

answered Apr 4 at 21:03

200_success

123k14142399

Awesome, thanks! I learn more than I expect through Stack Exchange thanks to people like you! Re: name of function - sorry, it's because the function is actually for something more detailed than I have described, and the above lines are just the first few lines of the function. I should have made that clear in the question/renamed it. I'll practice with regex functions next.
â€“Â rishijd
Apr 5 at 0:39

add a commentÂ |Â

up vote
6
down vote

accepted

Since Python strings are immutable, appending one character at a time using += is inefficient. You end up allocating a new string, copying all of the old string, then writing one character.

Instead, clean() should be written like this:

def clean(word):
 return ''.join(letter for letter in word.lower() if 'a' <= letter <= 'z')

Note that Python supports double-ended inequalities.

The name of your word_count_engine function poorly describes what it does. In fact, the function doesn't print or return anything, so it's all dead code. If I had to rewrite it, though, I'd say:

words = [word for word in map(clean, document.split()) if word]

Also consider replacing all of this code with a simple regular expression substitution.

answered Apr 4 at 21:03

200_success

123k14142399

Since Python strings are immutable, appending one character at a time using += is inefficient. You end up allocating a new string, copying all of the old string, then writing one character.

Instead, clean() should be written like this:

def clean(word):
 return ''.join(letter for letter in word.lower() if 'a' <= letter <= 'z')

Note that Python supports double-ended inequalities.

The name of your word_count_engine function poorly describes what it does. In fact, the function doesn't print or return anything, so it's all dead code. If I had to rewrite it, though, I'd say:

words = [word for word in map(clean, document.split()) if word]

Also consider replacing all of this code with a simple regular expression substitution.

answered Apr 4 at 21:03

200_success

123k14142399

answered Apr 4 at 21:03

200_success

123k14142399

answered Apr 4 at 21:03

200_success

123k14142399

answered Apr 4 at 21:03

200_success

123k14142399

Awesome, thanks! I learn more than I expect through Stack Exchange thanks to people like you! Re: name of function - sorry, it's because the function is actually for something more detailed than I have described, and the above lines are just the first few lines of the function. I should have made that clear in the question/renamed it. I'll practice with regex functions next.
â€“Â rishijd
Apr 5 at 0:39

add a commentÂ |Â

Awesome, thanks! I learn more than I expect through Stack Exchange thanks to people like you! Re: name of function - sorry, it's because the function is actually for something more detailed than I have described, and the above lines are just the first few lines of the function. I should have made that clear in the question/renamed it. I'll practice with regex functions next.
â€“Â rishijd
Apr 5 at 0:39

Awesome, thanks! I learn more than I expect through Stack Exchange thanks to people like you! Re: name of function - sorry, it's because the function is actually for something more detailed than I have described, and the above lines are just the first few lines of the function. I should have made that clear in the question/renamed it. I'll practice with regex functions next.
â€“Â rishijd
Apr 5 at 0:39

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr