Remove duplicate characters in a string

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
4
down vote

favorite












This is a question from the book "Cracking the Coding Interview".




Design an algorithm and write code to remove the duplicate characters
in a string without using any additional buffer NOTE: One or two
additional variables are fine An extra copy of the array is not




This is not the efficient code, but I want to preserve the order of characters after deleting duplicate characters.



#include <iostream>
#include <string>
#include <cctype> //std::toupper, std::tolower

void removeDuplicates(std::string& str)

int flag = 0;
int len = str.length();
for (int i = 0; i < len - 1; ++i)

for (int j = i + 1; j < len - 1; ++j)


if (flag == 0)

std::cout << "There are no duplicate charactersn";

else

std::cout << "There are duplicate charactersn";



int main()

std::string str;
std::cout << "Enter Stringn";
std::getline(std::cin, str);
removeDuplicates(str);
std::cout << "New Stringn";
std::cout << str << 'n';



Help me optimize this code.







share|improve this question



















  • It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
    – Errorsatz
    Jul 31 at 17:46






  • 3




    Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
    – Martin York
    Jul 31 at 17:49






  • 3




    If this were a real interview, my first question would be, "Do you mean remove doubled letters a la uniq, so that llama becomes lama? Or remove all duplicates a la sort | uniq, so that llama becomes alm?" "And if the latter, is it okay to become alm or must it become lam?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)
    – Quuxplusone
    Jul 31 at 19:01
















up vote
4
down vote

favorite












This is a question from the book "Cracking the Coding Interview".




Design an algorithm and write code to remove the duplicate characters
in a string without using any additional buffer NOTE: One or two
additional variables are fine An extra copy of the array is not




This is not the efficient code, but I want to preserve the order of characters after deleting duplicate characters.



#include <iostream>
#include <string>
#include <cctype> //std::toupper, std::tolower

void removeDuplicates(std::string& str)

int flag = 0;
int len = str.length();
for (int i = 0; i < len - 1; ++i)

for (int j = i + 1; j < len - 1; ++j)


if (flag == 0)

std::cout << "There are no duplicate charactersn";

else

std::cout << "There are duplicate charactersn";



int main()

std::string str;
std::cout << "Enter Stringn";
std::getline(std::cin, str);
removeDuplicates(str);
std::cout << "New Stringn";
std::cout << str << 'n';



Help me optimize this code.







share|improve this question



















  • It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
    – Errorsatz
    Jul 31 at 17:46






  • 3




    Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
    – Martin York
    Jul 31 at 17:49






  • 3




    If this were a real interview, my first question would be, "Do you mean remove doubled letters a la uniq, so that llama becomes lama? Or remove all duplicates a la sort | uniq, so that llama becomes alm?" "And if the latter, is it okay to become alm or must it become lam?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)
    – Quuxplusone
    Jul 31 at 19:01












up vote
4
down vote

favorite









up vote
4
down vote

favorite











This is a question from the book "Cracking the Coding Interview".




Design an algorithm and write code to remove the duplicate characters
in a string without using any additional buffer NOTE: One or two
additional variables are fine An extra copy of the array is not




This is not the efficient code, but I want to preserve the order of characters after deleting duplicate characters.



#include <iostream>
#include <string>
#include <cctype> //std::toupper, std::tolower

void removeDuplicates(std::string& str)

int flag = 0;
int len = str.length();
for (int i = 0; i < len - 1; ++i)

for (int j = i + 1; j < len - 1; ++j)


if (flag == 0)

std::cout << "There are no duplicate charactersn";

else

std::cout << "There are duplicate charactersn";



int main()

std::string str;
std::cout << "Enter Stringn";
std::getline(std::cin, str);
removeDuplicates(str);
std::cout << "New Stringn";
std::cout << str << 'n';



Help me optimize this code.







share|improve this question











This is a question from the book "Cracking the Coding Interview".




Design an algorithm and write code to remove the duplicate characters
in a string without using any additional buffer NOTE: One or two
additional variables are fine An extra copy of the array is not




This is not the efficient code, but I want to preserve the order of characters after deleting duplicate characters.



#include <iostream>
#include <string>
#include <cctype> //std::toupper, std::tolower

void removeDuplicates(std::string& str)

int flag = 0;
int len = str.length();
for (int i = 0; i < len - 1; ++i)

for (int j = i + 1; j < len - 1; ++j)


if (flag == 0)

std::cout << "There are no duplicate charactersn";

else

std::cout << "There are duplicate charactersn";



int main()

std::string str;
std::cout << "Enter Stringn";
std::getline(std::cin, str);
removeDuplicates(str);
std::cout << "New Stringn";
std::cout << str << 'n';



Help me optimize this code.









share|improve this question










share|improve this question




share|improve this question









asked Jul 31 at 17:29









coder

911425




911425











  • It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
    – Errorsatz
    Jul 31 at 17:46






  • 3




    Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
    – Martin York
    Jul 31 at 17:49






  • 3




    If this were a real interview, my first question would be, "Do you mean remove doubled letters a la uniq, so that llama becomes lama? Or remove all duplicates a la sort | uniq, so that llama becomes alm?" "And if the latter, is it okay to become alm or must it become lam?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)
    – Quuxplusone
    Jul 31 at 19:01
















  • It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
    – Errorsatz
    Jul 31 at 17:46






  • 3




    Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
    – Martin York
    Jul 31 at 17:49






  • 3




    If this were a real interview, my first question would be, "Do you mean remove doubled letters a la uniq, so that llama becomes lama? Or remove all duplicates a la sort | uniq, so that llama becomes alm?" "And if the latter, is it okay to become alm or must it become lam?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)
    – Quuxplusone
    Jul 31 at 19:01















It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
– Errorsatz
Jul 31 at 17:46




It would only have a significant effect if a number of characters are repeated 3+ times, but instead of doing string.erase for each character (potentially O(n^3)), you could manually shift the characters over, skipping all the matches in one pass, and use pop_back at the end to reduce the size, which would be O(n^2).
– Errorsatz
Jul 31 at 17:46




3




3




Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
– Martin York
Jul 31 at 17:49




Have a bool array of 255 would make this really efficient (and probably not break the rules). As you discover a letter mark it true in the array. If you see it again you can remove it.
– Martin York
Jul 31 at 17:49




3




3




If this were a real interview, my first question would be, "Do you mean remove doubled letters a la uniq, so that llama becomes lama? Or remove all duplicates a la sort | uniq, so that llama becomes alm?" "And if the latter, is it okay to become alm or must it become lam?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)
– Quuxplusone
Jul 31 at 19:01




If this were a real interview, my first question would be, "Do you mean remove doubled letters a la uniq, so that llama becomes lama? Or remove all duplicates a la sort | uniq, so that llama becomes alm?" "And if the latter, is it okay to become alm or must it become lam?" As an interviewer, I'd rather receive that kind of question than receive any answer. :)
– Quuxplusone
Jul 31 at 19:01










2 Answers
2






active

oldest

votes

















up vote
11
down vote



accepted










Clarify the requirement



You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?



Sort out the bugs



I sense premature optimisation here, and it is the root of some evil:



int len = str.length();
for (int i = 0; i < len - 1; ++i)

for (int j = i + 1; j < len - 1; ++j)

if (str[i] == std::toupper(str[j])




What is the value of len? The length of the original string. And what is the value of len after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len, or you could just call length() - it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length() until a profiler tells you otherwise.



Getting an idea of computational complexity



The complexity of the pair of loops is O(n2). But inside that, you do an erase - imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).



There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c in one go.



Without preserving order, we can do O(n log n)



The easy way to improve the complexity is just to sort the characters in the string:



std::sort(str.begin(), str.end());


And then run through removing adjacent duplicates.



Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?



Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.



If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.



Remove_if to the rescue



There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:



abcdeaafg_
^^
abcdeaafg_
^ ^
abcdefafg_
^ ^
abcdefgfg_
^ ^
abcdefg_


But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:



static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
std::bitset<256> seenChars;
auto newEnd = remove_if(str.begin(), str.end(),
[&seenChars](auto & c) // capture seenChars so we can use it
if(seenChars[c%256]) // have seen it before, so remove
return true;
seenChars[c%256] = true; // we have seen it now
return false; // but we won't remove this one
);


(Edited this after replacing array with bitset)



To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars] captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.



The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.



remove_if itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:



str.resize(std::distance(str.begin(), newEnd));


Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.



By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.






share|improve this answer



















  • 1




    std::array<bool, 256> seenChars; violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
    – Snowhawk
    Jul 31 at 21:55










  • @Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
    – Phil H
    Aug 1 at 6:00










  • std::array<bool, 256> isn't a fixed 64 bytes and there is no proxy specialization of std::array that bitpacks like std::vector<bool>. Assuming an implementation that defines sizeof(bool) == 1, the size of your container is 256 bytes, same as bool[256]. A bit-packing structure, like std::bitset, is likely what you are looking for.
    – Snowhawk
    Aug 1 at 8:44










  • @Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
    – Phil H
    Aug 1 at 14:11










  • Hi, why don't you simply use a std::exchange for your seenChars bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true);
    – DNK
    Aug 3 at 6:44

















up vote
4
down vote













void removeDuplicates(std::string& str)


Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm> library, we have std::remove/std::remove_if. Removing is described as follows:




Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.




Sounds exactly like what we're aiming for, but it doesn't actually call erase(). If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase() (see Erase-Remove Idiom).



std::string::iterator removeDuplicates(std::string& str) ... 

int main()
auto s = std::string"aaaaaa";
s.erase(removeDuplicates(s), s.end());
std::cout << s; // Outputs "a"




 int flag = 0;


C++ has a boolean type (bool).



flag is a bad name. Names are a nifty way to express intent or purpose. Since flag is flagging if a duplicate exists, why not name it duplicate_exists?



 bool duplicate_exists = false;



 int len = str.length();
for (int i = 0; i < len - 1; ++i) {
for (int j = i + 1; j < len - 1; ++j)


std::string::length() returns a size of type std::string::size_type (std::size_t typically). You have a conversion to int here (-Wconversion).



Every time you call str.erase(), you are reducing the physical length of str by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:



removeDuplicates("a") -> No Duplicate, "a"
removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
removeDuplicates("aaaaa") -> Segmentation Fault.


If you simply just str.length() on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.




 if (str[i] == std::toupper(str[j])
|| str[i] == std::tolower(str[j]))


"If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.




 if (flag == 0) 
std::cout << "There are no duplicate charactersn";
else
std::cout << "There are duplicate charactersn";



Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.






share|improve this answer























    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200682%2fremove-duplicate-characters-in-a-string%23new-answer', 'question_page');

    );

    Post as a guest






























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    11
    down vote



    accepted










    Clarify the requirement



    You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?



    Sort out the bugs



    I sense premature optimisation here, and it is the root of some evil:



    int len = str.length();
    for (int i = 0; i < len - 1; ++i)

    for (int j = i + 1; j < len - 1; ++j)

    if (str[i] == std::toupper(str[j])




    What is the value of len? The length of the original string. And what is the value of len after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len, or you could just call length() - it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length() until a profiler tells you otherwise.



    Getting an idea of computational complexity



    The complexity of the pair of loops is O(n2). But inside that, you do an erase - imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).



    There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c in one go.



    Without preserving order, we can do O(n log n)



    The easy way to improve the complexity is just to sort the characters in the string:



    std::sort(str.begin(), str.end());


    And then run through removing adjacent duplicates.



    Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?



    Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.



    If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.



    Remove_if to the rescue



    There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:



    abcdeaafg_
    ^^
    abcdeaafg_
    ^ ^
    abcdefafg_
    ^ ^
    abcdefgfg_
    ^ ^
    abcdefg_


    But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:



    static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
    std::bitset<256> seenChars;
    auto newEnd = remove_if(str.begin(), str.end(),
    [&seenChars](auto & c) // capture seenChars so we can use it
    if(seenChars[c%256]) // have seen it before, so remove
    return true;
    seenChars[c%256] = true; // we have seen it now
    return false; // but we won't remove this one
    );


    (Edited this after replacing array with bitset)



    To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars] captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.



    The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.



    remove_if itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:



    str.resize(std::distance(str.begin(), newEnd));


    Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.



    By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.






    share|improve this answer



















    • 1




      std::array<bool, 256> seenChars; violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
      – Snowhawk
      Jul 31 at 21:55










    • @Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
      – Phil H
      Aug 1 at 6:00










    • std::array<bool, 256> isn't a fixed 64 bytes and there is no proxy specialization of std::array that bitpacks like std::vector<bool>. Assuming an implementation that defines sizeof(bool) == 1, the size of your container is 256 bytes, same as bool[256]. A bit-packing structure, like std::bitset, is likely what you are looking for.
      – Snowhawk
      Aug 1 at 8:44










    • @Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
      – Phil H
      Aug 1 at 14:11










    • Hi, why don't you simply use a std::exchange for your seenChars bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true);
      – DNK
      Aug 3 at 6:44














    up vote
    11
    down vote



    accepted










    Clarify the requirement



    You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?



    Sort out the bugs



    I sense premature optimisation here, and it is the root of some evil:



    int len = str.length();
    for (int i = 0; i < len - 1; ++i)

    for (int j = i + 1; j < len - 1; ++j)

    if (str[i] == std::toupper(str[j])




    What is the value of len? The length of the original string. And what is the value of len after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len, or you could just call length() - it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length() until a profiler tells you otherwise.



    Getting an idea of computational complexity



    The complexity of the pair of loops is O(n2). But inside that, you do an erase - imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).



    There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c in one go.



    Without preserving order, we can do O(n log n)



    The easy way to improve the complexity is just to sort the characters in the string:



    std::sort(str.begin(), str.end());


    And then run through removing adjacent duplicates.



    Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?



    Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.



    If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.



    Remove_if to the rescue



    There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:



    abcdeaafg_
    ^^
    abcdeaafg_
    ^ ^
    abcdefafg_
    ^ ^
    abcdefgfg_
    ^ ^
    abcdefg_


    But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:



    static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
    std::bitset<256> seenChars;
    auto newEnd = remove_if(str.begin(), str.end(),
    [&seenChars](auto & c) // capture seenChars so we can use it
    if(seenChars[c%256]) // have seen it before, so remove
    return true;
    seenChars[c%256] = true; // we have seen it now
    return false; // but we won't remove this one
    );


    (Edited this after replacing array with bitset)



    To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars] captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.



    The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.



    remove_if itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:



    str.resize(std::distance(str.begin(), newEnd));


    Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.



    By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.






    share|improve this answer



















    • 1




      std::array<bool, 256> seenChars; violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
      – Snowhawk
      Jul 31 at 21:55










    • @Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
      – Phil H
      Aug 1 at 6:00










    • std::array<bool, 256> isn't a fixed 64 bytes and there is no proxy specialization of std::array that bitpacks like std::vector<bool>. Assuming an implementation that defines sizeof(bool) == 1, the size of your container is 256 bytes, same as bool[256]. A bit-packing structure, like std::bitset, is likely what you are looking for.
      – Snowhawk
      Aug 1 at 8:44










    • @Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
      – Phil H
      Aug 1 at 14:11










    • Hi, why don't you simply use a std::exchange for your seenChars bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true);
      – DNK
      Aug 3 at 6:44












    up vote
    11
    down vote



    accepted







    up vote
    11
    down vote



    accepted






    Clarify the requirement



    You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?



    Sort out the bugs



    I sense premature optimisation here, and it is the root of some evil:



    int len = str.length();
    for (int i = 0; i < len - 1; ++i)

    for (int j = i + 1; j < len - 1; ++j)

    if (str[i] == std::toupper(str[j])




    What is the value of len? The length of the original string. And what is the value of len after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len, or you could just call length() - it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length() until a profiler tells you otherwise.



    Getting an idea of computational complexity



    The complexity of the pair of loops is O(n2). But inside that, you do an erase - imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).



    There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c in one go.



    Without preserving order, we can do O(n log n)



    The easy way to improve the complexity is just to sort the characters in the string:



    std::sort(str.begin(), str.end());


    And then run through removing adjacent duplicates.



    Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?



    Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.



    If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.



    Remove_if to the rescue



    There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:



    abcdeaafg_
    ^^
    abcdeaafg_
    ^ ^
    abcdefafg_
    ^ ^
    abcdefgfg_
    ^ ^
    abcdefg_


    But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:



    static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
    std::bitset<256> seenChars;
    auto newEnd = remove_if(str.begin(), str.end(),
    [&seenChars](auto & c) // capture seenChars so we can use it
    if(seenChars[c%256]) // have seen it before, so remove
    return true;
    seenChars[c%256] = true; // we have seen it now
    return false; // but we won't remove this one
    );


    (Edited this after replacing array with bitset)



    To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars] captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.



    The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.



    remove_if itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:



    str.resize(std::distance(str.begin(), newEnd));


    Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.



    By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.






    share|improve this answer















    Clarify the requirement



    You state 'remove duplicate characters' but then your code checks for upper and lower case matches - is the requirement to check for distinct letters, or distinct characters?



    Sort out the bugs



    I sense premature optimisation here, and it is the root of some evil:



    int len = str.length();
    for (int i = 0; i < len - 1; ++i)

    for (int j = i + 1; j < len - 1; ++j)

    if (str[i] == std::toupper(str[j])




    What is the value of len? The length of the original string. And what is the value of len after we have erased a duplicate character? Still the same. So if we do find some duplicates, we will then happily run out of bounds in both loops. You can manually update len, or you could just call length() - it is constant complexity so you aren't rescanning the string each time. I would opt for comparing to length() until a profiler tells you otherwise.



    Getting an idea of computational complexity



    The complexity of the pair of loops is O(n2). But inside that, you do an erase - imagine we erase from the middle of the string (which is the average case); then we will need to copy about n/2 characters into the previous positions. This gives us O(n3).



    There are things we can do to improve a bit - we can continue scanning for further duplicates before erasing, so that abccccccccd only required one copy, collecting all the duplicate c in one go.



    Without preserving order, we can do O(n log n)



    The easy way to improve the complexity is just to sort the characters in the string:



    std::sort(str.begin(), str.end());


    And then run through removing adjacent duplicates.



    Is it possible to remove duplicates with O(n log n) (or better) complexity whilst maintaining order?



    Consider the decision whether to elide some character halfway through the string. This character could be any of (assuming 1 byte chars) 256 values, so the bare minimum storage needed to make the decision is 256 bits. The requirements are not clear - 'one or two additional variables' could easily cover a 256 element std::bitset (should be 64 bytes), but could also not cover it if they are angling for a specific solution.



    If we can keep that std::bitset, then we can mark each byte value that we've seen in the array and erase any contiguous blocks of duplicates we see. Marking a value in a fixed size array is constant time (O(1)), so the cost of detecting whether each character is a duplicate is still only O(n). But if we erase them each time we find them, we will end up in O(n2) land again, copying the last character in the string into successive locations.



    Remove_if to the rescue



    There is a way to remove all those duplicates without copying each later character more than once; imagine two position markers, one to the character we are assessing, and one to an earlier position at the end of the deduplicated string. If we copy each character directly into it's final position then the space left by duplicate characters will bubble to the end:



    abcdeaafg_
    ^^
    abcdeaafg_
    ^ ^
    abcdefafg_
    ^ ^
    abcdefgfg_
    ^ ^
    abcdefg_


    But instead of having to cope with all the details, we can call remove_if, which does just that job for us, at a cost of n calls to the predicate. The predicate (boolean test function) will look up the character value in the bitset:



    static_assert((CHAR_MAX - CHAR_MIN <= 256), "Implementation requires 8 bit char");
    std::bitset<256> seenChars;
    auto newEnd = remove_if(str.begin(), str.end(),
    [&seenChars](auto & c) // capture seenChars so we can use it
    if(seenChars[c%256]) // have seen it before, so remove
    return true;
    seenChars[c%256] = true; // we have seen it now
    return false; // but we won't remove this one
    );


    (Edited this after replacing array with bitset)



    To unpack that a little: we are removing any character for which the predicate returns true. The predicate simply checks whether the character has been seen (by casting the character to int and using it modulo 256 as the index into the bitset), and either returns true or marks this character value for next time and returns false. The [&seenChars] captures the bitset so that we can use it inside the predicate without having to pass it in; it forms part of the context of the function body.



    The hardcoded 256 is not ideal, perhaps, although I wouldn't want to make it general to larger numbers (instead just failing a static_assert) because of the implication for storage.



    remove_if itself returns an iterator to the new end position of the string; it has copied the latter values into the earlier positions, but not resized the string. We resize, then:



    str.resize(std::distance(str.begin(), newEnd));


    Et voila, we have removed duplicates, in place, with a space requirement of 256 bits and a time complexity of O(n). Not too shabby.



    By the way, you can detect the removal of duplicates by comparing the original string length with the final string length. If they differ, duplicates were indeed found.







    share|improve this answer















    share|improve this answer



    share|improve this answer








    edited Aug 1 at 16:12


























    answered Jul 31 at 21:04









    Phil H

    38647




    38647







    • 1




      std::array<bool, 256> seenChars; violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
      – Snowhawk
      Jul 31 at 21:55










    • @Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
      – Phil H
      Aug 1 at 6:00










    • std::array<bool, 256> isn't a fixed 64 bytes and there is no proxy specialization of std::array that bitpacks like std::vector<bool>. Assuming an implementation that defines sizeof(bool) == 1, the size of your container is 256 bytes, same as bool[256]. A bit-packing structure, like std::bitset, is likely what you are looking for.
      – Snowhawk
      Aug 1 at 8:44










    • @Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
      – Phil H
      Aug 1 at 14:11










    • Hi, why don't you simply use a std::exchange for your seenChars bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true);
      – DNK
      Aug 3 at 6:44












    • 1




      std::array<bool, 256> seenChars; violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
      – Snowhawk
      Jul 31 at 21:55










    • @Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
      – Phil H
      Aug 1 at 6:00










    • std::array<bool, 256> isn't a fixed 64 bytes and there is no proxy specialization of std::array that bitpacks like std::vector<bool>. Assuming an implementation that defines sizeof(bool) == 1, the size of your container is 256 bytes, same as bool[256]. A bit-packing structure, like std::bitset, is likely what you are looking for.
      – Snowhawk
      Aug 1 at 8:44










    • @Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
      – Phil H
      Aug 1 at 14:11










    • Hi, why don't you simply use a std::exchange for your seenChars bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true);
      – DNK
      Aug 3 at 6:44







    1




    1




    std::array<bool, 256> seenChars; violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
    – Snowhawk
    Jul 31 at 21:55




    std::array<bool, 256> seenChars; violates the only requirement laid out by the problem. You are also indexing with a signed value, in which a negative index would result in undefined behavior
    – Snowhawk
    Jul 31 at 21:55












    @Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
    – Phil H
    Aug 1 at 6:00




    @Snowhawk: Given the possible performance of O(n), and the fact that it's a fixed 64 bytes regardless of the size of the input, I'm not sure it contravenes the requirement. In an interview it would at least be worth discussing. I'll have a think about the signed int issue.
    – Phil H
    Aug 1 at 6:00












    std::array<bool, 256> isn't a fixed 64 bytes and there is no proxy specialization of std::array that bitpacks like std::vector<bool>. Assuming an implementation that defines sizeof(bool) == 1, the size of your container is 256 bytes, same as bool[256]. A bit-packing structure, like std::bitset, is likely what you are looking for.
    – Snowhawk
    Aug 1 at 8:44




    std::array<bool, 256> isn't a fixed 64 bytes and there is no proxy specialization of std::array that bitpacks like std::vector<bool>. Assuming an implementation that defines sizeof(bool) == 1, the size of your container is 256 bytes, same as bool[256]. A bit-packing structure, like std::bitset, is likely what you are looking for.
    – Snowhawk
    Aug 1 at 8:44












    @Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
    – Phil H
    Aug 1 at 14:11




    @Snowhawk: Huh, completely assumed there would be a specialization. I suppose it would break compatibility with bool[n].
    – Phil H
    Aug 1 at 14:11












    Hi, why don't you simply use a std::exchange for your seenChars bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true);
    – DNK
    Aug 3 at 6:44




    Hi, why don't you simply use a std::exchange for your seenChars bitset in your lambda? That would lead to one single line return std::exchange(seenChars[c%256], true);
    – DNK
    Aug 3 at 6:44












    up vote
    4
    down vote













    void removeDuplicates(std::string& str)


    Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm> library, we have std::remove/std::remove_if. Removing is described as follows:




    Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.




    Sounds exactly like what we're aiming for, but it doesn't actually call erase(). If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase() (see Erase-Remove Idiom).



    std::string::iterator removeDuplicates(std::string& str) ... 

    int main()
    auto s = std::string"aaaaaa";
    s.erase(removeDuplicates(s), s.end());
    std::cout << s; // Outputs "a"




     int flag = 0;


    C++ has a boolean type (bool).



    flag is a bad name. Names are a nifty way to express intent or purpose. Since flag is flagging if a duplicate exists, why not name it duplicate_exists?



     bool duplicate_exists = false;



     int len = str.length();
    for (int i = 0; i < len - 1; ++i) {
    for (int j = i + 1; j < len - 1; ++j)


    std::string::length() returns a size of type std::string::size_type (std::size_t typically). You have a conversion to int here (-Wconversion).



    Every time you call str.erase(), you are reducing the physical length of str by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:



    removeDuplicates("a") -> No Duplicate, "a"
    removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
    removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
    removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
    removeDuplicates("aaaaa") -> Segmentation Fault.


    If you simply just str.length() on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.




     if (str[i] == std::toupper(str[j])
    || str[i] == std::tolower(str[j]))


    "If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.




     if (flag == 0) 
    std::cout << "There are no duplicate charactersn";
    else
    std::cout << "There are duplicate charactersn";



    Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.






    share|improve this answer



























      up vote
      4
      down vote













      void removeDuplicates(std::string& str)


      Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm> library, we have std::remove/std::remove_if. Removing is described as follows:




      Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.




      Sounds exactly like what we're aiming for, but it doesn't actually call erase(). If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase() (see Erase-Remove Idiom).



      std::string::iterator removeDuplicates(std::string& str) ... 

      int main()
      auto s = std::string"aaaaaa";
      s.erase(removeDuplicates(s), s.end());
      std::cout << s; // Outputs "a"




       int flag = 0;


      C++ has a boolean type (bool).



      flag is a bad name. Names are a nifty way to express intent or purpose. Since flag is flagging if a duplicate exists, why not name it duplicate_exists?



       bool duplicate_exists = false;



       int len = str.length();
      for (int i = 0; i < len - 1; ++i) {
      for (int j = i + 1; j < len - 1; ++j)


      std::string::length() returns a size of type std::string::size_type (std::size_t typically). You have a conversion to int here (-Wconversion).



      Every time you call str.erase(), you are reducing the physical length of str by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:



      removeDuplicates("a") -> No Duplicate, "a"
      removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
      removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
      removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
      removeDuplicates("aaaaa") -> Segmentation Fault.


      If you simply just str.length() on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.




       if (str[i] == std::toupper(str[j])
      || str[i] == std::tolower(str[j]))


      "If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.




       if (flag == 0) 
      std::cout << "There are no duplicate charactersn";
      else
      std::cout << "There are duplicate charactersn";



      Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.






      share|improve this answer

























        up vote
        4
        down vote










        up vote
        4
        down vote









        void removeDuplicates(std::string& str)


        Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm> library, we have std::remove/std::remove_if. Removing is described as follows:




        Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.




        Sounds exactly like what we're aiming for, but it doesn't actually call erase(). If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase() (see Erase-Remove Idiom).



        std::string::iterator removeDuplicates(std::string& str) ... 

        int main()
        auto s = std::string"aaaaaa";
        s.erase(removeDuplicates(s), s.end());
        std::cout << s; // Outputs "a"




         int flag = 0;


        C++ has a boolean type (bool).



        flag is a bad name. Names are a nifty way to express intent or purpose. Since flag is flagging if a duplicate exists, why not name it duplicate_exists?



         bool duplicate_exists = false;



         int len = str.length();
        for (int i = 0; i < len - 1; ++i) {
        for (int j = i + 1; j < len - 1; ++j)


        std::string::length() returns a size of type std::string::size_type (std::size_t typically). You have a conversion to int here (-Wconversion).



        Every time you call str.erase(), you are reducing the physical length of str by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:



        removeDuplicates("a") -> No Duplicate, "a"
        removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
        removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
        removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
        removeDuplicates("aaaaa") -> Segmentation Fault.


        If you simply just str.length() on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.




         if (str[i] == std::toupper(str[j])
        || str[i] == std::tolower(str[j]))


        "If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.




         if (flag == 0) 
        std::cout << "There are no duplicate charactersn";
        else
        std::cout << "There are duplicate charactersn";



        Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.






        share|improve this answer















        void removeDuplicates(std::string& str)


        Are we removing elements or erasing elements? The function name tells me one thing, the actual algorithm does the other. In the <algorithm> library, we have std::remove/std::remove_if. Removing is described as follows:




        Removing is done by shifting (by means of move assignment) the elements in the range in such a way that the elements that are not to be removed appear in the beginning of the range. Relative order of the elements that remain is preserved and the physical size of the container is unchanged. Iterators pointing to an element between the new logical end and the physical end of the range are still dereferenceable, but the elements themselves have unspecified values (as per MoveAssignable post-condition). A call to remove is typically followed by a call to a container's erase method, which erases the unspecified values and reduces the physical size of the container to match its new logical size.




        Sounds exactly like what we're aiming for, but it doesn't actually call erase(). If your algorithm is going to be designed to remove elements, then just remove elements. Return an iterator to one-past the last element moved and let the caller invoke erase() (see Erase-Remove Idiom).



        std::string::iterator removeDuplicates(std::string& str) ... 

        int main()
        auto s = std::string"aaaaaa";
        s.erase(removeDuplicates(s), s.end());
        std::cout << s; // Outputs "a"




         int flag = 0;


        C++ has a boolean type (bool).



        flag is a bad name. Names are a nifty way to express intent or purpose. Since flag is flagging if a duplicate exists, why not name it duplicate_exists?



         bool duplicate_exists = false;



         int len = str.length();
        for (int i = 0; i < len - 1; ++i) {
        for (int j = i + 1; j < len - 1; ++j)


        std::string::length() returns a size of type std::string::size_type (std::size_t typically). You have a conversion to int here (-Wconversion).



        Every time you call str.erase(), you are reducing the physical length of str by 1. However, your loop exit conditions use a cached value of the original length. Consider the following inputs:



        removeDuplicates("a") -> No Duplicate, "a"
        removeDuplicates("aa") -> No Duplicate, "aa" << Both Wrong.
        removeDuplicates("aaa") -> Duplicate, "aa" << String is wrong.
        removeDuplicates("aaaa") -> Duplicate, "aa" << String is wrong.
        removeDuplicates("aaaaa") -> Segmentation Fault.


        If you simply just str.length() on every iteration, you won't have the segfault issue, but you'll still have logic issues in generating a deduplicated string. The standard algorithms use the shifting method to avoid the extra logic required to deal with erase and string lengths.




         if (str[i] == std::toupper(str[j])
        || str[i] == std::tolower(str[j]))


        "If a necessary feature has a high astonishment factor, it may be necessary to redesign the feature" (Principle of least astonishment).No where in the problem statement, in your understanding of the problem, function names, or in comments is there a mention that this is case-insensitive. For this problem, I would expect $A neq a$. Follow the requirements of the problem and document changes like this.




         if (flag == 0) 
        std::cout << "There are no duplicate charactersn";
        else
        std::cout << "There are duplicate charactersn";



        Should this function be writing to the console? Following the single responsibility principle. This function should focus on removing duplicates and another function, like the caller, can report to the console.







        share|improve this answer















        share|improve this answer



        share|improve this answer








        edited Aug 1 at 8:41


























        answered Aug 1 at 7:58









        Snowhawk

        3,9401925




        3,9401925






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f200682%2fremove-duplicate-characters-in-a-string%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Greedy Best First Search implementation in Rust

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            C++11 CLH Lock Implementation