trjhtr

Question

This is a question from the book "Cracking the Coding Interview".

Design an algorithm and write code to remove the duplicate characters
in a string without using any additional buffer NOTE: One or two
additional variables are fine An extra copy of the array is not

This is not the efficient code, but I want to preserve the order of characters after deleting duplicate characters.

#include <iostream>
#include <string>
#include <cctype> //std::toupper, std::tolower

void removeDuplicates(std::string& str)

 int flag = 0;
 int len = str.length();
 for (int i = 0; i < len - 1; ++i)
 
 for (int j = i + 1; j < len - 1; ++j)
 
 
 if (flag == 0)
 
 std::cout << "There are no duplicate charactersn";
 
 else
 
 std::cout << "There are duplicate charactersn";
 


int main()

 std::string str;
 std::cout << "Enter Stringn";
 std::getline(std::cin, str);
 removeDuplicates(str);
 std::cout << "New Stringn";
 std::cout << str << 'n';

Help me optimize this code.

It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2). — Jul 31 at 17:46
Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it. — Jul 31 at 17:49
If this were a real interview, my first question would be, "Do you mean remove doubled letters a la uniq, so that llama becomes lama? Or remove all duplicates a la sort | uniq, so that llama becomes alm?" "And if the latter, is it okay to become alm or must it become lam?" As an interviewer, I'd rather receive that kind of question than receive any answer. :) — Jul 31 at 19:01

score 11 · Accepted Answer · 2018-08-01 16:12:49Z

Clarify the requirement

You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?

Sort out the bugs

I sense premature optimisation here, and it is the root of some evil:

int len = str.length();
for (int i = 0; i < len - 1; ++i)

 for (int j = i + 1; j < len - 1; ++j)
 
 if (str[i] == std::toupper(str[j])

What is the value of len? The length of the original string. And what is the value of len after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len, or you could just call length() - it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length() until a profiler tells you otherwise.

Getting an idea of computational complexity

The complexity of the pair of loops is O(n²). But inside that, you do an erase - imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n³).

There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c in one go.

Without preserving order, we can do O(n log n)

The easy way to improve the complexity is just to sort the characters in the string:

std::sort(str.begin(), str.end());

And then run through removing adjacent duplicates.

Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?

Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.

If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n²) land again, copying the last character in the string into successive locations.

Remove_if to the rescue

There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:

abcdeaafg_
 ^^
abcdeaafg_
 ^ ^
abcdefafg_
 ^ ^
abcdefgfg_
 ^ ^
abcdefg_

But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:

static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
std::bitset<256> seenChars;
auto newEnd = remove_if(str.begin(), str.end(),
 [&seenChars](auto & c) // capture seenChars so we can use it
 if(seenChars[c%256]) // have seen it before, so remove
 return true;
 seenChars[c%256] = true; // we have seen it now
 return false; // but we won't remove this one
 );

(Edited this after replacing array with bitset)

To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars] captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.

The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.

remove_if itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:

str.resize(std::distance(str.begin(), newEnd));

Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.

By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.

std::array<bool, 256> seenChars; violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior — Jul 31 at 21:55
@Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue. — Aug 1 at 6:00
std::array<bool, 256> isn't a fixed 64 bytes and there is no proxy specialization of std::array that bitpacks like std::vector<bool>. Assuming an implementation that defines sizeof(bool) == 1, the size of your container is 256 bytes, same as bool[256]. A bit-packing structure, like std::bitset, is likely what you are looking for. — Aug 1 at 8:44
@Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n]. — Aug 1 at 14:11
Hi, why don't you simply use a std::exchange for your seenChars bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true); — Aug 3 at 6:44

score 4 · Answer 2 · 2018-08-01 08:41:31Z

void removeDuplicates(std::string& str)

Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm> library, we have std::remove/std::remove_if. Removing is described as follows:

Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.

Sounds exactly like what we're aiming for, but it doesn't actually call erase(). If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase() (see Erase-Remove Idiom).

std::string::iterator removeDuplicates(std::string& str) ... 

int main() 
 auto s = std::string"aaaaaa";
 s.erase(removeDuplicates(s), s.end());
 std::cout << s; // Outputs "a"

 int flag = 0;

C++ has a boolean type (bool).

flag is a bad name. Names are a nifty way to express intent or purpose. Since flag is flagging if a duplicate exists, why not name it duplicate_exists?

 bool duplicate_exists = false;

 int len = str.length();
 for (int i = 0; i < len - 1; ++i) {
 for (int j = i + 1; j < len - 1; ++j)

std::string::length() returns a size of type std::string::size_type (std::size_t typically). You have a conversion to int here (-Wconversion).

Every time you call str.erase(), you are reducing the physical length of str by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:

removeDuplicates("a") -> No Duplicate, "a"
removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaaa") -> Segmentation Fault.

If you simply just str.length() on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.

 if (str[i] == std::toupper(str[j])
 || str[i] == std::tolower(str[j]))

"If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.

 if (flag == 0) 
 std::cout << "There are no duplicate charactersn";
 else 
 std::cout << "There are duplicate charactersn";

Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.

score 11 · Accepted Answer · 2018-08-01 16:12:49Z

Clarify the requirement

You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?

Sort out the bugs

I sense premature optimisation here, and it is the root of some evil:

int len = str.length();
for (int i = 0; i < len - 1; ++i)

 for (int j = i + 1; j < len - 1; ++j)
 
 if (str[i] == std::toupper(str[j])

What is the value of len? The length of the original string. And what is the value of len after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len, or you could just call length() - it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length() until a profiler tells you otherwise.

Getting an idea of computational complexity

The complexity of the pair of loops is O(n²). But inside that, you do an erase - imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n³).

There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c in one go.

Without preserving order, we can do O(n log n)

The easy way to improve the complexity is just to sort the characters in the string:

std::sort(str.begin(), str.end());

And then run through removing adjacent duplicates.

Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?

Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.

If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n²) land again, copying the last character in the string into successive locations.

Remove_if to the rescue

There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:

abcdeaafg_
 ^^
abcdeaafg_
 ^ ^
abcdefafg_
 ^ ^
abcdefgfg_
 ^ ^
abcdefg_

But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:

static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
std::bitset<256> seenChars;
auto newEnd = remove_if(str.begin(), str.end(),
 [&seenChars](auto & c) // capture seenChars so we can use it
 if(seenChars[c%256]) // have seen it before, so remove
 return true;
 seenChars[c%256] = true; // we have seen it now
 return false; // but we won't remove this one
 );

(Edited this after replacing array with bitset)

To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars] captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.

The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.

remove_if itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:

str.resize(std::distance(str.begin(), newEnd));

Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.

By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.

std::array<bool, 256> seenChars; violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior — Jul 31 at 21:55
@Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue. — Aug 1 at 6:00
std::array<bool, 256> isn't a fixed 64 bytes and there is no proxy specialization of std::array that bitpacks like std::vector<bool>. Assuming an implementation that defines sizeof(bool) == 1, the size of your container is 256 bytes, same as bool[256]. A bit-packing structure, like std::bitset, is likely what you are looking for. — Aug 1 at 8:44
@Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n]. — Aug 1 at 14:11
Hi, why don't you simply use a std::exchange for your seenChars bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true); — Aug 3 at 6:44

score 4 · Answer 4 · 2018-08-01 08:41:31Z

void removeDuplicates(std::string& str)

Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm> library, we have std::remove/std::remove_if. Removing is described as follows:

Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.

Sounds exactly like what we're aiming for, but it doesn't actually call erase(). If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase() (see Erase-Remove Idiom).

std::string::iterator removeDuplicates(std::string& str) ... 

int main() 
 auto s = std::string"aaaaaa";
 s.erase(removeDuplicates(s), s.end());
 std::cout << s; // Outputs "a"

 int flag = 0;

C++ has a boolean type (bool).

flag is a bad name. Names are a nifty way to express intent or purpose. Since flag is flagging if a duplicate exists, why not name it duplicate_exists?

 bool duplicate_exists = false;

 int len = str.length();
 for (int i = 0; i < len - 1; ++i) {
 for (int j = i + 1; j < len - 1; ++j)

std::string::length() returns a size of type std::string::size_type (std::size_t typically). You have a conversion to int here (-Wconversion).

Every time you call str.erase(), you are reducing the physical length of str by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:

removeDuplicates("a") -> No Duplicate, "a"
removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaaa") -> Segmentation Fault.

If you simply just str.length() on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.

 if (str[i] == std::toupper(str[j])
 || str[i] == std::tolower(str[j]))

"If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.

 if (flag == 0) 
 std::cout << "There are no duplicate charactersn";
 else 
 std::cout << "There are duplicate charactersn";

Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.

Remove duplicate characters in a string

2 Answers 2

Clarify the requirement

Sort out the bugs

Getting an idea of computational complexity

Without preserving order, we can do O(n log n)

Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?

Remove_if to the rescue

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Clarify the requirement

Sort out the bugs

Getting an idea of computational complexity

Without preserving order, we can do O(n log n)

Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?

Remove_if to the rescue

Clarify the requirement

Sort out the bugs

Getting an idea of computational complexity

Without preserving order, we can do O(n log n)

Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?

Remove_if to the rescue

Clarify the requirement

Sort out the bugs

Getting an idea of computational complexity

Without preserving order, we can do O(n log n)

Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?

Remove_if to the rescue

Clarify the requirement

Sort out the bugs

Getting an idea of computational complexity

Without preserving order, we can do O(n log n)

Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?

Remove_if to the rescue

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Chat program with C++ and SFML

Read an image with ADNS2610 optical sensor and Arduino Uno

Read files from a directory using Promises

2 Answers
2

2 Answers
2

2 Answers
2