Remove duplicate characters in a string
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
4
down vote
favorite
This is a question from the book "Cracking the Coding Interview".
Design an algorithm and write code to remove the duplicate characters
in a string without using any additional buffer NOTE: One or two
additional variables are fine An extra copy of the array is not
This is not the efficient code, but I want to preserve the order of characters after deleting duplicate characters.
#include <iostream>
#include <string>
#include <cctype> //std::toupper, std::tolower
void removeDuplicates(std::string& str)
int flag = 0;
int len = str.length();
for (int i = 0; i < len - 1; ++i)
for (int j = i + 1; j < len - 1; ++j)
if (flag == 0)
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";
int main()
std::string str;
std::cout << "Enter Stringn";
std::getline(std::cin, str);
removeDuplicates(str);
std::cout << "New Stringn";
std::cout << str << 'n';
Help me optimize this code.
c++ programming-challenge interview-questions
add a comment |Â
up vote
4
down vote
favorite
This is a question from the book "Cracking the Coding Interview".
Design an algorithm and write code to remove the duplicate characters
in a string without using any additional buffer NOTE: One or two
additional variables are fine An extra copy of the array is not
This is not the efficient code, but I want to preserve the order of characters after deleting duplicate characters.
#include <iostream>
#include <string>
#include <cctype> //std::toupper, std::tolower
void removeDuplicates(std::string& str)
int flag = 0;
int len = str.length();
for (int i = 0; i < len - 1; ++i)
for (int j = i + 1; j < len - 1; ++j)
if (flag == 0)
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";
int main()
std::string str;
std::cout << "Enter Stringn";
std::getline(std::cin, str);
removeDuplicates(str);
std::cout << "New Stringn";
std::cout << str << 'n';
Help me optimize this code.
c++ programming-challenge interview-questions
It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
â Errorsatz
Jul 31 at 17:46
3
Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
â Martin York
Jul 31 at 17:49
3
If this were a real interview, my first question would be, "Do you mean remove doubled letters a launiq
, so thatllama
becomeslama
? Or remove all duplicates a lasort | uniq
, so thatllama
becomesalm
?" "And if the latter, is it okay to becomealm
or must it becomelam
?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)
â Quuxplusone
Jul 31 at 19:01
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
This is a question from the book "Cracking the Coding Interview".
Design an algorithm and write code to remove the duplicate characters
in a string without using any additional buffer NOTE: One or two
additional variables are fine An extra copy of the array is not
This is not the efficient code, but I want to preserve the order of characters after deleting duplicate characters.
#include <iostream>
#include <string>
#include <cctype> //std::toupper, std::tolower
void removeDuplicates(std::string& str)
int flag = 0;
int len = str.length();
for (int i = 0; i < len - 1; ++i)
for (int j = i + 1; j < len - 1; ++j)
if (flag == 0)
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";
int main()
std::string str;
std::cout << "Enter Stringn";
std::getline(std::cin, str);
removeDuplicates(str);
std::cout << "New Stringn";
std::cout << str << 'n';
Help me optimize this code.
c++ programming-challenge interview-questions
This is a question from the book "Cracking the Coding Interview".
Design an algorithm and write code to remove the duplicate characters
in a string without using any additional buffer NOTE: One or two
additional variables are fine An extra copy of the array is not
This is not the efficient code, but I want to preserve the order of characters after deleting duplicate characters.
#include <iostream>
#include <string>
#include <cctype> //std::toupper, std::tolower
void removeDuplicates(std::string& str)
int flag = 0;
int len = str.length();
for (int i = 0; i < len - 1; ++i)
for (int j = i + 1; j < len - 1; ++j)
if (flag == 0)
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";
int main()
std::string str;
std::cout << "Enter Stringn";
std::getline(std::cin, str);
removeDuplicates(str);
std::cout << "New Stringn";
std::cout << str << 'n';
Help me optimize this code.
c++ programming-challenge interview-questions
asked Jul 31 at 17:29
coder
911425
911425
It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
â Errorsatz
Jul 31 at 17:46
3
Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
â Martin York
Jul 31 at 17:49
3
If this were a real interview, my first question would be, "Do you mean remove doubled letters a launiq
, so thatllama
becomeslama
? Or remove all duplicates a lasort | uniq
, so thatllama
becomesalm
?" "And if the latter, is it okay to becomealm
or must it becomelam
?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)
â Quuxplusone
Jul 31 at 19:01
add a comment |Â
It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
â Errorsatz
Jul 31 at 17:46
3
Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
â Martin York
Jul 31 at 17:49
3
If this were a real interview, my first question would be, "Do you mean remove doubled letters a launiq
, so thatllama
becomeslama
? Or remove all duplicates a lasort | uniq
, so thatllama
becomesalm
?" "And if the latter, is it okay to becomealm
or must it becomelam
?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)
â Quuxplusone
Jul 31 at 19:01
It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
â Errorsatz
Jul 31 at 17:46
It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
â Errorsatz
Jul 31 at 17:46
3
3
Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
â Martin York
Jul 31 at 17:49
Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
â Martin York
Jul 31 at 17:49
3
3
If this were a real interview, my first question would be, "Do you mean remove doubled letters a la
uniq
, so that llama
becomes lama
? Or remove all duplicates a la sort | uniq
, so that llama
becomes alm
?" "And if the latter, is it okay to become alm
or must it become lam
?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)â Quuxplusone
Jul 31 at 19:01
If this were a real interview, my first question would be, "Do you mean remove doubled letters a la
uniq
, so that llama
becomes lama
? Or remove all duplicates a la sort | uniq
, so that llama
becomes alm
?" "And if the latter, is it okay to become alm
or must it become lam
?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)â Quuxplusone
Jul 31 at 19:01
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
11
down vote
accepted
Clarify the requirement
You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?
Sort out the bugs
I sense premature optimisation here, and it is the root of some evil:
int len = str.length();
for (int i = 0; i < len - 1; ++i)
for (int j = i + 1; j < len - 1; ++j)
if (str[i] == std::toupper(str[j])
What is the value of len
? The length of the original string. And what is the value of len
after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len
, or you could just call length()
- it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length()
until a profiler tells you otherwise.
Getting an idea of computational complexity
The complexity of the pair of loops is O(n2). But inside that, you do an erase
- imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).
There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c
in one go.
Without preserving order, we can do O(n log n)
The easy way to improve the complexity is just to sort the characters in the string:
std::sort(str.begin(), str.end());
And then run through removing adjacent duplicates.
Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?
Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.
If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.
Remove_if to the rescue
There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:
abcdeaafg_
^^
abcdeaafg_
^ ^
abcdefafg_
^ ^
abcdefgfg_
^ ^
abcdefg_
But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:
static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
std::bitset<256> seenChars;
auto newEnd = remove_if(str.begin(), str.end(),
[&seenChars](auto & c) // capture seenChars so we can use it
if(seenChars[c%256]) // have seen it before, so remove
return true;
seenChars[c%256] = true; // we have seen it now
return false; // but we won't remove this one
);
(Edited this after replacing array with bitset)
To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars]
captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.
The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.
remove_if
itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:
str.resize(std::distance(str.begin(), newEnd));
Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.
By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.
1
std::array<bool, 256> seenChars;
violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
â Snowhawk
Jul 31 at 21:55
@Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
â Phil H
Aug 1 at 6:00
std::array<bool, 256>
isn't a fixed 64 bytes and there is no proxy specialization ofstd::array
that bitpacks likestd::vector<bool>
. Assuming an implementation that definessizeof(bool) == 1
, the size of your container is 256 bytes, same asbool[256]
. A bit-packing structure, likestd::bitset
, is likely what you are looking for.
â Snowhawk
Aug 1 at 8:44
@Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
â Phil H
Aug 1 at 14:11
Hi, why don't you simply use astd::exchange
for yourseenChars
bitset in your lambda? That would lead to one single linereturn std::exchange(seenChars[c%256], true);
â DNK
Aug 3 at 6:44
 |Â
show 1 more comment
up vote
4
down vote
void removeDuplicates(std::string& str)
Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm>
library, we have std::remove
/std::remove_if
. Removing is described as follows:
Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.
Sounds exactly like what we're aiming for, but it doesn't actually call erase()
. If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase()
(see Erase-Remove Idiom).
std::string::iterator removeDuplicates(std::string& str) ...
int main()
auto s = std::string"aaaaaa";
s.erase(removeDuplicates(s), s.end());
std::cout << s; // Outputs "a"
int flag = 0;
C++ has a boolean type (bool
).
flag
is a bad name. Names are a nifty way to express intent or purpose. Since flag
is flagging if a duplicate exists, why not name it duplicate_exists
?
bool duplicate_exists = false;
int len = str.length();
for (int i = 0; i < len - 1; ++i) {
for (int j = i + 1; j < len - 1; ++j)
std::string::length()
returns a size of type std::string::size_type
(std::size_t
typically). You have a conversion to int
here (-Wconversion
).
Every time you call str.erase()
, you are reducing the physical length of str
by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:
removeDuplicates("a") -> No Duplicate, "a"
removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaaa") -> Segmentation Fault.
If you simply just str.length()
on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.
if (str[i] == std::toupper(str[j])
|| str[i] == std::tolower(str[j]))
"If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.
if (flag == 0)
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";
Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
11
down vote
accepted
Clarify the requirement
You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?
Sort out the bugs
I sense premature optimisation here, and it is the root of some evil:
int len = str.length();
for (int i = 0; i < len - 1; ++i)
for (int j = i + 1; j < len - 1; ++j)
if (str[i] == std::toupper(str[j])
What is the value of len
? The length of the original string. And what is the value of len
after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len
, or you could just call length()
- it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length()
until a profiler tells you otherwise.
Getting an idea of computational complexity
The complexity of the pair of loops is O(n2). But inside that, you do an erase
- imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).
There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c
in one go.
Without preserving order, we can do O(n log n)
The easy way to improve the complexity is just to sort the characters in the string:
std::sort(str.begin(), str.end());
And then run through removing adjacent duplicates.
Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?
Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.
If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.
Remove_if to the rescue
There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:
abcdeaafg_
^^
abcdeaafg_
^ ^
abcdefafg_
^ ^
abcdefgfg_
^ ^
abcdefg_
But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:
static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
std::bitset<256> seenChars;
auto newEnd = remove_if(str.begin(), str.end(),
[&seenChars](auto & c) // capture seenChars so we can use it
if(seenChars[c%256]) // have seen it before, so remove
return true;
seenChars[c%256] = true; // we have seen it now
return false; // but we won't remove this one
);
(Edited this after replacing array with bitset)
To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars]
captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.
The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.
remove_if
itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:
str.resize(std::distance(str.begin(), newEnd));
Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.
By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.
1
std::array<bool, 256> seenChars;
violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
â Snowhawk
Jul 31 at 21:55
@Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
â Phil H
Aug 1 at 6:00
std::array<bool, 256>
isn't a fixed 64 bytes and there is no proxy specialization ofstd::array
that bitpacks likestd::vector<bool>
. Assuming an implementation that definessizeof(bool) == 1
, the size of your container is 256 bytes, same asbool[256]
. A bit-packing structure, likestd::bitset
, is likely what you are looking for.
â Snowhawk
Aug 1 at 8:44
@Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
â Phil H
Aug 1 at 14:11
Hi, why don't you simply use astd::exchange
for yourseenChars
bitset in your lambda? That would lead to one single linereturn std::exchange(seenChars[c%256], true);
â DNK
Aug 3 at 6:44
 |Â
show 1 more comment
up vote
11
down vote
accepted
Clarify the requirement
You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?
Sort out the bugs
I sense premature optimisation here, and it is the root of some evil:
int len = str.length();
for (int i = 0; i < len - 1; ++i)
for (int j = i + 1; j < len - 1; ++j)
if (str[i] == std::toupper(str[j])
What is the value of len
? The length of the original string. And what is the value of len
after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len
, or you could just call length()
- it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length()
until a profiler tells you otherwise.
Getting an idea of computational complexity
The complexity of the pair of loops is O(n2). But inside that, you do an erase
- imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).
There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c
in one go.
Without preserving order, we can do O(n log n)
The easy way to improve the complexity is just to sort the characters in the string:
std::sort(str.begin(), str.end());
And then run through removing adjacent duplicates.
Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?
Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.
If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.
Remove_if to the rescue
There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:
abcdeaafg_
^^
abcdeaafg_
^ ^
abcdefafg_
^ ^
abcdefgfg_
^ ^
abcdefg_
But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:
static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
std::bitset<256> seenChars;
auto newEnd = remove_if(str.begin(), str.end(),
[&seenChars](auto & c) // capture seenChars so we can use it
if(seenChars[c%256]) // have seen it before, so remove
return true;
seenChars[c%256] = true; // we have seen it now
return false; // but we won't remove this one
);
(Edited this after replacing array with bitset)
To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars]
captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.
The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.
remove_if
itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:
str.resize(std::distance(str.begin(), newEnd));
Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.
By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.
1
std::array<bool, 256> seenChars;
violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
â Snowhawk
Jul 31 at 21:55
@Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
â Phil H
Aug 1 at 6:00
std::array<bool, 256>
isn't a fixed 64 bytes and there is no proxy specialization ofstd::array
that bitpacks likestd::vector<bool>
. Assuming an implementation that definessizeof(bool) == 1
, the size of your container is 256 bytes, same asbool[256]
. A bit-packing structure, likestd::bitset
, is likely what you are looking for.
â Snowhawk
Aug 1 at 8:44
@Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
â Phil H
Aug 1 at 14:11
Hi, why don't you simply use astd::exchange
for yourseenChars
bitset in your lambda? That would lead to one single linereturn std::exchange(seenChars[c%256], true);
â DNK
Aug 3 at 6:44
 |Â
show 1 more comment
up vote
11
down vote
accepted
up vote
11
down vote
accepted
Clarify the requirement
You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?
Sort out the bugs
I sense premature optimisation here, and it is the root of some evil:
int len = str.length();
for (int i = 0; i < len - 1; ++i)
for (int j = i + 1; j < len - 1; ++j)
if (str[i] == std::toupper(str[j])
What is the value of len
? The length of the original string. And what is the value of len
after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len
, or you could just call length()
- it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length()
until a profiler tells you otherwise.
Getting an idea of computational complexity
The complexity of the pair of loops is O(n2). But inside that, you do an erase
- imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).
There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c
in one go.
Without preserving order, we can do O(n log n)
The easy way to improve the complexity is just to sort the characters in the string:
std::sort(str.begin(), str.end());
And then run through removing adjacent duplicates.
Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?
Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.
If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.
Remove_if to the rescue
There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:
abcdeaafg_
^^
abcdeaafg_
^ ^
abcdefafg_
^ ^
abcdefgfg_
^ ^
abcdefg_
But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:
static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
std::bitset<256> seenChars;
auto newEnd = remove_if(str.begin(), str.end(),
[&seenChars](auto & c) // capture seenChars so we can use it
if(seenChars[c%256]) // have seen it before, so remove
return true;
seenChars[c%256] = true; // we have seen it now
return false; // but we won't remove this one
);
(Edited this after replacing array with bitset)
To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars]
captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.
The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.
remove_if
itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:
str.resize(std::distance(str.begin(), newEnd));
Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.
By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.
Clarify the requirement
You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?
Sort out the bugs
I sense premature optimisation here, and it is the root of some evil:
int len = str.length();
for (int i = 0; i < len - 1; ++i)
for (int j = i + 1; j < len - 1; ++j)
if (str[i] == std::toupper(str[j])
What is the value of len
? The length of the original string. And what is the value of len
after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len
, or you could just call length()
- it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length()
until a profiler tells you otherwise.
Getting an idea of computational complexity
The complexity of the pair of loops is O(n2). But inside that, you do an erase
- imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).
There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c
in one go.
Without preserving order, we can do O(n log n)
The easy way to improve the complexity is just to sort the characters in the string:
std::sort(str.begin(), str.end());
And then run through removing adjacent duplicates.
Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?
Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.
If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.
Remove_if to the rescue
There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:
abcdeaafg_
^^
abcdeaafg_
^ ^
abcdefafg_
^ ^
abcdefgfg_
^ ^
abcdefg_
But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:
static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
std::bitset<256> seenChars;
auto newEnd = remove_if(str.begin(), str.end(),
[&seenChars](auto & c) // capture seenChars so we can use it
if(seenChars[c%256]) // have seen it before, so remove
return true;
seenChars[c%256] = true; // we have seen it now
return false; // but we won't remove this one
);
(Edited this after replacing array with bitset)
To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars]
captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.
The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.
remove_if
itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:
str.resize(std::distance(str.begin(), newEnd));
Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.
By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.
edited Aug 1 at 16:12
answered Jul 31 at 21:04
Phil H
38647
38647
1
std::array<bool, 256> seenChars;
violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
â Snowhawk
Jul 31 at 21:55
@Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
â Phil H
Aug 1 at 6:00
std::array<bool, 256>
isn't a fixed 64 bytes and there is no proxy specialization ofstd::array
that bitpacks likestd::vector<bool>
. Assuming an implementation that definessizeof(bool) == 1
, the size of your container is 256 bytes, same asbool[256]
. A bit-packing structure, likestd::bitset
, is likely what you are looking for.
â Snowhawk
Aug 1 at 8:44
@Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
â Phil H
Aug 1 at 14:11
Hi, why don't you simply use astd::exchange
for yourseenChars
bitset in your lambda? That would lead to one single linereturn std::exchange(seenChars[c%256], true);
â DNK
Aug 3 at 6:44
 |Â
show 1 more comment
1
std::array<bool, 256> seenChars;
violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
â Snowhawk
Jul 31 at 21:55
@Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
â Phil H
Aug 1 at 6:00
std::array<bool, 256>
isn't a fixed 64 bytes and there is no proxy specialization ofstd::array
that bitpacks likestd::vector<bool>
. Assuming an implementation that definessizeof(bool) == 1
, the size of your container is 256 bytes, same asbool[256]
. A bit-packing structure, likestd::bitset
, is likely what you are looking for.
â Snowhawk
Aug 1 at 8:44
@Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
â Phil H
Aug 1 at 14:11
Hi, why don't you simply use astd::exchange
for yourseenChars
bitset in your lambda? That would lead to one single linereturn std::exchange(seenChars[c%256], true);
â DNK
Aug 3 at 6:44
1
1
std::array<bool, 256> seenChars;
violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behaviorâ Snowhawk
Jul 31 at 21:55
std::array<bool, 256> seenChars;
violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behaviorâ Snowhawk
Jul 31 at 21:55
@Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
â Phil H
Aug 1 at 6:00
@Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
â Phil H
Aug 1 at 6:00
std::array<bool, 256>
isn't a fixed 64 bytes and there is no proxy specialization of std::array
that bitpacks like std::vector<bool>
. Assuming an implementation that defines sizeof(bool) == 1
, the size of your container is 256 bytes, same as bool[256]
. A bit-packing structure, like std::bitset
, is likely what you are looking for.â Snowhawk
Aug 1 at 8:44
std::array<bool, 256>
isn't a fixed 64 bytes and there is no proxy specialization of std::array
that bitpacks like std::vector<bool>
. Assuming an implementation that defines sizeof(bool) == 1
, the size of your container is 256 bytes, same as bool[256]
. A bit-packing structure, like std::bitset
, is likely what you are looking for.â Snowhawk
Aug 1 at 8:44
@Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
â Phil H
Aug 1 at 14:11
@Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
â Phil H
Aug 1 at 14:11
Hi, why don't you simply use a
std::exchange
for your seenChars
bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true);
â DNK
Aug 3 at 6:44
Hi, why don't you simply use a
std::exchange
for your seenChars
bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true);
â DNK
Aug 3 at 6:44
 |Â
show 1 more comment
up vote
4
down vote
void removeDuplicates(std::string& str)
Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm>
library, we have std::remove
/std::remove_if
. Removing is described as follows:
Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.
Sounds exactly like what we're aiming for, but it doesn't actually call erase()
. If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase()
(see Erase-Remove Idiom).
std::string::iterator removeDuplicates(std::string& str) ...
int main()
auto s = std::string"aaaaaa";
s.erase(removeDuplicates(s), s.end());
std::cout << s; // Outputs "a"
int flag = 0;
C++ has a boolean type (bool
).
flag
is a bad name. Names are a nifty way to express intent or purpose. Since flag
is flagging if a duplicate exists, why not name it duplicate_exists
?
bool duplicate_exists = false;
int len = str.length();
for (int i = 0; i < len - 1; ++i) {
for (int j = i + 1; j < len - 1; ++j)
std::string::length()
returns a size of type std::string::size_type
(std::size_t
typically). You have a conversion to int
here (-Wconversion
).
Every time you call str.erase()
, you are reducing the physical length of str
by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:
removeDuplicates("a") -> No Duplicate, "a"
removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaaa") -> Segmentation Fault.
If you simply just str.length()
on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.
if (str[i] == std::toupper(str[j])
|| str[i] == std::tolower(str[j]))
"If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.
if (flag == 0)
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";
Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.
add a comment |Â
up vote
4
down vote
void removeDuplicates(std::string& str)
Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm>
library, we have std::remove
/std::remove_if
. Removing is described as follows:
Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.
Sounds exactly like what we're aiming for, but it doesn't actually call erase()
. If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase()
(see Erase-Remove Idiom).
std::string::iterator removeDuplicates(std::string& str) ...
int main()
auto s = std::string"aaaaaa";
s.erase(removeDuplicates(s), s.end());
std::cout << s; // Outputs "a"
int flag = 0;
C++ has a boolean type (bool
).
flag
is a bad name. Names are a nifty way to express intent or purpose. Since flag
is flagging if a duplicate exists, why not name it duplicate_exists
?
bool duplicate_exists = false;
int len = str.length();
for (int i = 0; i < len - 1; ++i) {
for (int j = i + 1; j < len - 1; ++j)
std::string::length()
returns a size of type std::string::size_type
(std::size_t
typically). You have a conversion to int
here (-Wconversion
).
Every time you call str.erase()
, you are reducing the physical length of str
by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:
removeDuplicates("a") -> No Duplicate, "a"
removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaaa") -> Segmentation Fault.
If you simply just str.length()
on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.
if (str[i] == std::toupper(str[j])
|| str[i] == std::tolower(str[j]))
"If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.
if (flag == 0)
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";
Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.
add a comment |Â
up vote
4
down vote
up vote
4
down vote
void removeDuplicates(std::string& str)
Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm>
library, we have std::remove
/std::remove_if
. Removing is described as follows:
Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.
Sounds exactly like what we're aiming for, but it doesn't actually call erase()
. If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase()
(see Erase-Remove Idiom).
std::string::iterator removeDuplicates(std::string& str) ...
int main()
auto s = std::string"aaaaaa";
s.erase(removeDuplicates(s), s.end());
std::cout << s; // Outputs "a"
int flag = 0;
C++ has a boolean type (bool
).
flag
is a bad name. Names are a nifty way to express intent or purpose. Since flag
is flagging if a duplicate exists, why not name it duplicate_exists
?
bool duplicate_exists = false;
int len = str.length();
for (int i = 0; i < len - 1; ++i) {
for (int j = i + 1; j < len - 1; ++j)
std::string::length()
returns a size of type std::string::size_type
(std::size_t
typically). You have a conversion to int
here (-Wconversion
).
Every time you call str.erase()
, you are reducing the physical length of str
by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:
removeDuplicates("a") -> No Duplicate, "a"
removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaaa") -> Segmentation Fault.
If you simply just str.length()
on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.
if (str[i] == std::toupper(str[j])
|| str[i] == std::tolower(str[j]))
"If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.
if (flag == 0)
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";
Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.
void removeDuplicates(std::string& str)
Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm>
library, we have std::remove
/std::remove_if
. Removing is described as follows:
Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.
Sounds exactly like what we're aiming for, but it doesn't actually call erase()
. If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase()
(see Erase-Remove Idiom).
std::string::iterator removeDuplicates(std::string& str) ...
int main()
auto s = std::string"aaaaaa";
s.erase(removeDuplicates(s), s.end());
std::cout << s; // Outputs "a"
int flag = 0;
C++ has a boolean type (bool
).
flag
is a bad name. Names are a nifty way to express intent or purpose. Since flag
is flagging if a duplicate exists, why not name it duplicate_exists
?
bool duplicate_exists = false;
int len = str.length();
for (int i = 0; i < len - 1; ++i) {
for (int j = i + 1; j < len - 1; ++j)
std::string::length()
returns a size of type std::string::size_type
(std::size_t
typically). You have a conversion to int
here (-Wconversion
).
Every time you call str.erase()
, you are reducing the physical length of str
by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:
removeDuplicates("a") -> No Duplicate, "a"
removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaaa") -> Segmentation Fault.
If you simply just str.length()
on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.
if (str[i] == std::toupper(str[j])
|| str[i] == std::tolower(str[j]))
"If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.
if (flag == 0)
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";
Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.
edited Aug 1 at 8:41
answered Aug 1 at 7:58
Snowhawk
3,9401925
3,9401925
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200682%2fremove-duplicate-characters-in-a-string%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
â Errorsatz
Jul 31 at 17:46
3
Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
â Martin York
Jul 31 at 17:49
3
If this were a real interview, my first question would be, "Do you mean remove doubled letters a la
uniq
, so thatllama
becomeslama
? Or remove all duplicates a lasort | uniq
, so thatllama
becomesalm
?" "And if the latter, is it okay to becomealm
or must it becomelam
?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)â Quuxplusone
Jul 31 at 19:01