Determine if one string is a permutation of the another

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
12
down vote

favorite

Can anyone give me some feedback if I'm using std::unordered_multimap optimally? I don't know the ins-and-outs of the std::unordered_multimap implementation.

I'm wondering if comparing two maps iterates through each of the elements in the map, making the algorithm $O(N^2)$ or if it just compares the hash of the map or something making the algorithm $O(N)$. Or would it just be better to use an array if I'm given this question in an interview?

bool is_permutation(const std::string& s1, const std::string& s2)

 assert(s1.size() && s2.size());

 if(s1.length() != s2.length())
 return false;

 std::unordered_multimap<int, char> m1, m2;

 for(int i = 0; i < s1.length(); ++i)
 
 m1.insert(s1[i], s1[i]);
 m2.insert(s2[i], s2[i]);
 

 if(m1 == m2)
 return true;

 return false;

edited Apr 18 at 23:38

Jamalâ™¦

30.1k11114225

asked Apr 17 at 20:20

John DeBord

634

Fun fact: this problem (of determining if a string is a permutation of another) is the classic example of a problem that is inherently faster in imperative languages such as C++ than declarative languages such as SML where lists are defined recursively as head and tail: O(n) in C++ while the best possible in SML is O(n log n).
â€“Â Pedro A
Apr 18 at 0:05

You are assuming that the string contains single-byte characters only. It can get false positives with UTF-8 text.
â€“Â JDÃ…Â‚ugosz
Apr 18 at 6:14

4

Sort the characters in both strings and compare them?
â€“Â ThorbjÃ¸rn Ravn Andersen
Apr 18 at 8:56

Is there a good reason not to simply return std::is_permutation(s1.begin(), s1.end(), s2.begin(), s2.end());? Or should this be tagged reinventing-the-wheel?
â€“Â Toby Speight
Apr 23 at 19:14

add a commentÂ |Â

up vote
12
down vote

favorite

Can anyone give me some feedback if I'm using std::unordered_multimap optimally? I don't know the ins-and-outs of the std::unordered_multimap implementation.

bool is_permutation(const std::string& s1, const std::string& s2)

 assert(s1.size() && s2.size());

 if(s1.length() != s2.length())
 return false;

 std::unordered_multimap<int, char> m1, m2;

 for(int i = 0; i < s1.length(); ++i)
 
 m1.insert(s1[i], s1[i]);
 m2.insert(s2[i], s2[i]);
 

 if(m1 == m2)
 return true;

 return false;

edited Apr 18 at 23:38

Jamalâ™¦

30.1k11114225

asked Apr 17 at 20:20

John DeBord

634

Fun fact: this problem (of determining if a string is a permutation of another) is the classic example of a problem that is inherently faster in imperative languages such as C++ than declarative languages such as SML where lists are defined recursively as head and tail: O(n) in C++ while the best possible in SML is O(n log n).
â€“Â Pedro A
Apr 18 at 0:05

You are assuming that the string contains single-byte characters only. It can get false positives with UTF-8 text.
â€“Â JDÃ…Â‚ugosz
Apr 18 at 6:14

4

Sort the characters in both strings and compare them?
â€“Â ThorbjÃ¸rn Ravn Andersen
Apr 18 at 8:56

Is there a good reason not to simply return std::is_permutation(s1.begin(), s1.end(), s2.begin(), s2.end());? Or should this be tagged reinventing-the-wheel?
â€“Â Toby Speight
Apr 23 at 19:14

add a commentÂ |Â

up vote
12
down vote

favorite

Can anyone give me some feedback if I'm using std::unordered_multimap optimally? I don't know the ins-and-outs of the std::unordered_multimap implementation.

bool is_permutation(const std::string& s1, const std::string& s2)

 assert(s1.size() && s2.size());

 if(s1.length() != s2.length())
 return false;

 std::unordered_multimap<int, char> m1, m2;

 for(int i = 0; i < s1.length(); ++i)
 
 m1.insert(s1[i], s1[i]);
 m2.insert(s2[i], s2[i]);
 

 if(m1 == m2)
 return true;

 return false;

edited Apr 18 at 23:38

Jamalâ™¦

30.1k11114225

asked Apr 17 at 20:20

John DeBord

634

Can anyone give me some feedback if I'm using std::unordered_multimap optimally? I don't know the ins-and-outs of the std::unordered_multimap implementation.

bool is_permutation(const std::string& s1, const std::string& s2)

 assert(s1.size() && s2.size());

 if(s1.length() != s2.length())
 return false;

 std::unordered_multimap<int, char> m1, m2;

 for(int i = 0; i < s1.length(); ++i)
 
 m1.insert(s1[i], s1[i]);
 m2.insert(s2[i], s2[i]);
 

 if(m1 == m2)
 return true;

 return false;

edited Apr 18 at 23:38

Jamalâ™¦

30.1k11114225

asked Apr 17 at 20:20

John DeBord

634

edited Apr 18 at 23:38

Jamalâ™¦

30.1k11114225

edited Apr 18 at 23:38

Jamalâ™¦

30.1k11114225

edited Apr 18 at 23:38

Jamalâ™¦

30.1k11114225

asked Apr 17 at 20:20

John DeBord

634

asked Apr 17 at 20:20

John DeBord

634

asked Apr 17 at 20:20

John DeBord

634

Fun fact: this problem (of determining if a string is a permutation of another) is the classic example of a problem that is inherently faster in imperative languages such as C++ than declarative languages such as SML where lists are defined recursively as head and tail: O(n) in C++ while the best possible in SML is O(n log n).
â€“Â Pedro A
Apr 18 at 0:05

You are assuming that the string contains single-byte characters only. It can get false positives with UTF-8 text.
â€“Â JDÃ…Â‚ugosz
Apr 18 at 6:14

4

Sort the characters in both strings and compare them?
â€“Â ThorbjÃ¸rn Ravn Andersen
Apr 18 at 8:56

Is there a good reason not to simply return std::is_permutation(s1.begin(), s1.end(), s2.begin(), s2.end());? Or should this be tagged reinventing-the-wheel?
â€“Â Toby Speight
Apr 23 at 19:14

add a commentÂ |Â

Fun fact: this problem (of determining if a string is a permutation of another) is the classic example of a problem that is inherently faster in imperative languages such as C++ than declarative languages such as SML where lists are defined recursively as head and tail: O(n) in C++ while the best possible in SML is O(n log n).
â€“Â Pedro A
Apr 18 at 0:05

You are assuming that the string contains single-byte characters only. It can get false positives with UTF-8 text.
â€“Â JDÃ…Â‚ugosz
Apr 18 at 6:14

4

Sort the characters in both strings and compare them?
â€“Â ThorbjÃ¸rn Ravn Andersen
Apr 18 at 8:56

Is there a good reason not to simply return std::is_permutation(s1.begin(), s1.end(), s2.begin(), s2.end());? Or should this be tagged reinventing-the-wheel?
â€“Â Toby Speight
Apr 23 at 19:14

Fun fact: this problem (of determining if a string is a permutation of another) is the classic example of a problem that is inherently faster in imperative languages such as C++ than declarative languages such as SML where lists are defined recursively as head and tail: O(n) in C++ while the best possible in SML is O(n log n).
â€“Â Pedro A
Apr 18 at 0:05

You are assuming that the string contains single-byte characters only. It can get false positives with UTF-8 text.
â€“Â JDÃ…Â‚ugosz
Apr 18 at 6:14

Sort the characters in both strings and compare them?
â€“Â ThorbjÃ¸rn Ravn Andersen
Apr 18 at 8:56

Is there a good reason not to simply return std::is_permutation(s1.begin(), s1.end(), s2.begin(), s2.end());? Or should this be tagged reinventing-the-wheel?
â€“Â Toby Speight
Apr 23 at 19:14

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
14
down vote

accepted

Comparing two maps is linear, so that's not a big problem.

I don't think you've made particularly good use of maps though. As I see it, you have two choices. You could use an unordered_multiset, or you could use a unordered_map<char, size_t>. In the latter case, you'd keep a count of each character in the string, and increment the count for each character as you scan through the input strings.

Personally, I think it's generally simpler to just sort the two strings, then compare them:

bool is_permutation(std::string s1, std::string s2) 
 if (s1.length() != s2.length())
 return false;

 std::sort(s1.begin(), s1.end());
 std::sort(s2.begin(), s2.end());
 return s1 == s2;

A version making good use of an unordered_map:

bool is_perm2(std::string const &a, std::string const &b) 
 std::unordered_map<char, std::size_t> s1;
 std::unordered_map<char, std::size_t> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

...will gain a substantial advantage if the strings involved are long. Don't expect miracles though--on typical strings (a few dozen to a few hundred characters) overhead often matters more than computational complexity, so this is unlikely to accomplish much. On the other hand, if you honestly expect to process strings of (say) a megabyte or more on a fairly regular basis, you can expect it to run significantly faster than the others.

I have not, however, gotten nearly as good of results from using a multiset or unordered_multiset. In fact, both are quite slow in my testing. I suspect the problem is that for this case, they're both storing a lot of data (a node for each character in the input), so before the string is large enough for computational complexity to matter much, they're overflowing the cache, so you end up with a lot of references to main memory. Note that when used as in the question, std::unordered_multimap ends up similarly (or probably even worse).

For anybody who might care, here's a chart of how various approaches work out for different sizes of strings:

enter image description here

[This shows the log of the time on the vertical axis, to help keep things visible across a wide range of sizes (and with them, times).]

So this shows the unordered_map having pretty high overhead, but nice, slow growth, so it's the slowest for small strings, but the fastest for strings over 8K in size.

Sorting is somewhat similar--faster for strings up to a kilobyte, but slower than the unordered_map for larger strings.

Using a raw array for the counts has behavior similar to sort, but even more dramatic--very fast for small strings, but slows down quite a lot for larger strings.

There seems to be no size at which the set/unordered_set (or unordered_multimap as used in the question) be the optimal choice.

Oh, for anybody who wonders about the odd stair-step look: I generated two tests for strings of each length: one with the strings identical, one with one character changed between them (i.e., one that should return true and another than should return false). I suppose if I wanted to be really complete, I should also add a test case for strings that are shuffled versions of each other, so they are unequal, but should still return true, but I haven't bothered with that (I have tested to be reasonably certain all the charted algorithms actually work though).

Edit 2:

After more testing, I have to agree with the consensus that counting characters in an array is a better choice:

bool is_perm3(std::string const &a, std::string const &b) 
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s1;
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 for (std::size_t i=0; i<s1.size(); i++)
 if (s1[i] != s2[i])
 return false;
 return true;

I remain puzzled (and frankly, rather bothered) but the poor performance of std::vector vs. std::array in his test. We'd expect array to have less initial overhead, but the vector was pre-allocated, so speed differences after being set up should be quite minimal (but in this case definitely are not).

edited Apr 18 at 16:40

answered Apr 17 at 20:26

Jerry Coffin

27.4k360123

Can you explain to me why it is O(NlogN)? I don't understand how my solution is O(NlogN) as well as yours?
â€“Â John DeBord
Apr 17 at 20:30

@JohnDeBord: Sorry--I hadn't read carefully enough (or maybe just my brain slipped a cog), and thought you were using a multimap instead of unordered_multimap.
â€“Â Jerry Coffin
Apr 17 at 20:31

5

Sorting seems like an overkill. Count letters and compare histograms in $O(N)$.
â€“Â vnp
Apr 17 at 21:00

3

@vnp: I am betting that sorting is the quickest for short strings.
â€“Â Martin York
Apr 17 at 23:26

@JerryCoffin Surprised with array results. Did you just replace the unordered_map with C-Array?
â€“Â Martin York
Apr 17 at 23:30

Â |Â
show 7 more comments

up vote
15
down vote

I would say yes, use an array.

Arrays afford very good memory locality, which is good for cache performance. They essentially are the most basic and efficient hash map if the key is the index.

You only need one array with 256 elements: first count up the number of times each character appears in the first string, and then count down for the number of times each character appears in the second string. Precisely if the two strings are permutations, the array will finish with all zeroes.

As a further hint at relative complexity, according to godbolt.org, your code produces 849 lines of assembly when compiled on the most recent GCC with -O3 optimisation.
The following code produces 53 lines.

bool is_permutation(const std::string& s1, const std::string& s2)

 if(s1.length() != s2.length())
 return false;

 // There are usually 256 possible characters but on rare architectures may be more.
 constexpr possible_char_count = std::numeric_limits<unsigned char>::max()+1;
 std::array<size_t, possible_char_count > m1 = ; // Zero initialized all places

 for(char c : s1) // Count up for each character in the first string
 m1[static_cast<unsigned char>(c)]++; 
 
 for(char c : s2) // Count down for each character in the second string
 m1[static_cast<unsigned char>(c)]--; 
 

 for (size_t t : m1) 
 // If any character appears a net nonzero number of times,
 // the two strings are not anagrams.
 if (t != 0) 
 return false;
 
 
 return true;

Lines of code are not, of themselves, a perfect indicator of what is the most efficient algorithm. After all, you can implement a nice short version of bubble sort! Nevertheless when the underlying approach is the same (that of counting characters) it is a fairly good indicator that one implementation is doing unnecessary work.

One reason to prefer a map to an array is that when the input strings are very short or very homogenous, then the map is likely to be more memory efficient. This may translate into speed improvements by avoiding allocation and inialisation overhead. As always, the only reliable way to decide is to profile with your real world inputs.

edited Apr 18 at 22:04

answered Apr 17 at 23:24

Josiah

3,182326

3

Note that 1 << CHAR_BIT is not guaranteed to be 256.
â€“Â wchargin
Apr 18 at 7:48

I would recommend using something else than long long int: either assert that the strings are short-ish (< 2GB each) and use std::int32_t or aim for space and use std::int64_t. The latter, of course, incurs a significant memory overhead.
â€“Â Matthieu M.
Apr 18 at 13:12

1

For those following at home; the assembly of the long long int and int versions => godbolt.org/g/pGpHFt . As expected (1) the initialization and final checks are optimized and (2) the int versions requires only half as many iterations.
â€“Â Matthieu M.
Apr 18 at 13:18

add a commentÂ |Â

up vote
6
down vote

You have got the basics.

Though @Jerry Coffin points out some good points.

I would take this one step further.

A std::unordered_multimap is good for the general case. Where the thing you are finding and comparing can be arbitrarily complex. But here the thing we are comparing is a char. Very simple and a very small number of them (even if you use all 256).

So I would change to use an array of integers. One integer for each character. Though lets use a vector do easier comparisons.

// Stealing @Jerry Coffin code
bool is_perm2(std::string const &a, std::string const &b) 

 if (a.length() != b.length())
 return false;

 std::array<int, 1 + UCHAR_MAX> s1;
 std::array<int, 1 + UCHAR_MAX> s2;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

Notes:

Complexity O(n).

Overhead per iteration very small.

Overhead high for initialization.

edited Apr 18 at 16:58

answered Apr 17 at 23:23

Martin York

70.8k481244

2

What's the significance of casting unsigned char to unsigned char? Isn't that just an identity? For real portability, I'd write 1 + UCHAR_MAX rather than 256,
â€“Â Toby Speight
Apr 18 at 9:27

Shouldn't that be char or preferable auto for c?
â€“Â Deduplicator
Apr 18 at 16:21

add a commentÂ |Â

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192327%2fdetermine-if-one-string-is-a-permutation-of-the-another%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
14
down vote

accepted

Comparing two maps is linear, so that's not a big problem.

Personally, I think it's generally simpler to just sort the two strings, then compare them:

bool is_permutation(std::string s1, std::string s2) 
 if (s1.length() != s2.length())
 return false;

 std::sort(s1.begin(), s1.end());
 std::sort(s2.begin(), s2.end());
 return s1 == s2;

A version making good use of an unordered_map:

bool is_perm2(std::string const &a, std::string const &b) 
 std::unordered_map<char, std::size_t> s1;
 std::unordered_map<char, std::size_t> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

For anybody who might care, here's a chart of how various approaches work out for different sizes of strings:

enter image description here

[This shows the log of the time on the vertical axis, to help keep things visible across a wide range of sizes (and with them, times).]

So this shows the unordered_map having pretty high overhead, but nice, slow growth, so it's the slowest for small strings, but the fastest for strings over 8K in size.

Sorting is somewhat similar--faster for strings up to a kilobyte, but slower than the unordered_map for larger strings.

Using a raw array for the counts has behavior similar to sort, but even more dramatic--very fast for small strings, but slows down quite a lot for larger strings.

There seems to be no size at which the set/unordered_set (or unordered_multimap as used in the question) be the optimal choice.

Edit 2:

After more testing, I have to agree with the consensus that counting characters in an array is a better choice:

bool is_perm3(std::string const &a, std::string const &b) 
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s1;
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 for (std::size_t i=0; i<s1.size(); i++)
 if (s1[i] != s2[i])
 return false;
 return true;

edited Apr 18 at 16:40

answered Apr 17 at 20:26

Jerry Coffin

27.4k360123

Can you explain to me why it is O(NlogN)? I don't understand how my solution is O(NlogN) as well as yours?
â€“Â John DeBord
Apr 17 at 20:30

@JohnDeBord: Sorry--I hadn't read carefully enough (or maybe just my brain slipped a cog), and thought you were using a multimap instead of unordered_multimap.
â€“Â Jerry Coffin
Apr 17 at 20:31

5

Sorting seems like an overkill. Count letters and compare histograms in $O(N)$.
â€“Â vnp
Apr 17 at 21:00

3

@vnp: I am betting that sorting is the quickest for short strings.
â€“Â Martin York
Apr 17 at 23:26

@JerryCoffin Surprised with array results. Did you just replace the unordered_map with C-Array?
â€“Â Martin York
Apr 17 at 23:30

Â |Â
show 7 more comments

up vote
14
down vote

accepted

Comparing two maps is linear, so that's not a big problem.

Personally, I think it's generally simpler to just sort the two strings, then compare them:

bool is_permutation(std::string s1, std::string s2) 
 if (s1.length() != s2.length())
 return false;

 std::sort(s1.begin(), s1.end());
 std::sort(s2.begin(), s2.end());
 return s1 == s2;

A version making good use of an unordered_map:

bool is_perm2(std::string const &a, std::string const &b) 
 std::unordered_map<char, std::size_t> s1;
 std::unordered_map<char, std::size_t> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

For anybody who might care, here's a chart of how various approaches work out for different sizes of strings:

enter image description here

[This shows the log of the time on the vertical axis, to help keep things visible across a wide range of sizes (and with them, times).]

So this shows the unordered_map having pretty high overhead, but nice, slow growth, so it's the slowest for small strings, but the fastest for strings over 8K in size.

Sorting is somewhat similar--faster for strings up to a kilobyte, but slower than the unordered_map for larger strings.

Using a raw array for the counts has behavior similar to sort, but even more dramatic--very fast for small strings, but slows down quite a lot for larger strings.

There seems to be no size at which the set/unordered_set (or unordered_multimap as used in the question) be the optimal choice.

Edit 2:

After more testing, I have to agree with the consensus that counting characters in an array is a better choice:

bool is_perm3(std::string const &a, std::string const &b) 
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s1;
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 for (std::size_t i=0; i<s1.size(); i++)
 if (s1[i] != s2[i])
 return false;
 return true;

edited Apr 18 at 16:40

answered Apr 17 at 20:26

Jerry Coffin

27.4k360123

Can you explain to me why it is O(NlogN)? I don't understand how my solution is O(NlogN) as well as yours?
â€“Â John DeBord
Apr 17 at 20:30

@JohnDeBord: Sorry--I hadn't read carefully enough (or maybe just my brain slipped a cog), and thought you were using a multimap instead of unordered_multimap.
â€“Â Jerry Coffin
Apr 17 at 20:31

5

Sorting seems like an overkill. Count letters and compare histograms in $O(N)$.
â€“Â vnp
Apr 17 at 21:00

3

@vnp: I am betting that sorting is the quickest for short strings.
â€“Â Martin York
Apr 17 at 23:26

@JerryCoffin Surprised with array results. Did you just replace the unordered_map with C-Array?
â€“Â Martin York
Apr 17 at 23:30

Â |Â
show 7 more comments

up vote
14
down vote

accepted

Comparing two maps is linear, so that's not a big problem.

Personally, I think it's generally simpler to just sort the two strings, then compare them:

bool is_permutation(std::string s1, std::string s2) 
 if (s1.length() != s2.length())
 return false;

 std::sort(s1.begin(), s1.end());
 std::sort(s2.begin(), s2.end());
 return s1 == s2;

A version making good use of an unordered_map:

bool is_perm2(std::string const &a, std::string const &b) 
 std::unordered_map<char, std::size_t> s1;
 std::unordered_map<char, std::size_t> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

For anybody who might care, here's a chart of how various approaches work out for different sizes of strings:

enter image description here

[This shows the log of the time on the vertical axis, to help keep things visible across a wide range of sizes (and with them, times).]

So this shows the unordered_map having pretty high overhead, but nice, slow growth, so it's the slowest for small strings, but the fastest for strings over 8K in size.

Sorting is somewhat similar--faster for strings up to a kilobyte, but slower than the unordered_map for larger strings.

Using a raw array for the counts has behavior similar to sort, but even more dramatic--very fast for small strings, but slows down quite a lot for larger strings.

There seems to be no size at which the set/unordered_set (or unordered_multimap as used in the question) be the optimal choice.

Edit 2:

After more testing, I have to agree with the consensus that counting characters in an array is a better choice:

bool is_perm3(std::string const &a, std::string const &b) 
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s1;
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 for (std::size_t i=0; i<s1.size(); i++)
 if (s1[i] != s2[i])
 return false;
 return true;

edited Apr 18 at 16:40

answered Apr 17 at 20:26

Jerry Coffin

27.4k360123

Comparing two maps is linear, so that's not a big problem.

Personally, I think it's generally simpler to just sort the two strings, then compare them:

bool is_permutation(std::string s1, std::string s2) 
 if (s1.length() != s2.length())
 return false;

 std::sort(s1.begin(), s1.end());
 std::sort(s2.begin(), s2.end());
 return s1 == s2;

A version making good use of an unordered_map:

bool is_perm2(std::string const &a, std::string const &b) 
 std::unordered_map<char, std::size_t> s1;
 std::unordered_map<char, std::size_t> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

For anybody who might care, here's a chart of how various approaches work out for different sizes of strings:

enter image description here

[This shows the log of the time on the vertical axis, to help keep things visible across a wide range of sizes (and with them, times).]

So this shows the unordered_map having pretty high overhead, but nice, slow growth, so it's the slowest for small strings, but the fastest for strings over 8K in size.

Sorting is somewhat similar--faster for strings up to a kilobyte, but slower than the unordered_map for larger strings.

Using a raw array for the counts has behavior similar to sort, but even more dramatic--very fast for small strings, but slows down quite a lot for larger strings.

There seems to be no size at which the set/unordered_set (or unordered_multimap as used in the question) be the optimal choice.

Edit 2:

After more testing, I have to agree with the consensus that counting characters in an array is a better choice:

bool is_perm3(std::string const &a, std::string const &b) 
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s1;
 std::array<std::size_t, std::numeric_limits<unsigned char>::max()+1> s2;

 if (a.length() != b.length())
 return false;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 for (std::size_t i=0; i<s1.size(); i++)
 if (s1[i] != s2[i])
 return false;
 return true;

edited Apr 18 at 16:40

answered Apr 17 at 20:26

Jerry Coffin

27.4k360123

edited Apr 18 at 16:40

answered Apr 17 at 20:26

Jerry Coffin

27.4k360123

answered Apr 17 at 20:26

Jerry Coffin

27.4k360123

answered Apr 17 at 20:26

Jerry Coffin

27.4k360123

Can you explain to me why it is O(NlogN)? I don't understand how my solution is O(NlogN) as well as yours?
â€“Â John DeBord
Apr 17 at 20:30

@JohnDeBord: Sorry--I hadn't read carefully enough (or maybe just my brain slipped a cog), and thought you were using a multimap instead of unordered_multimap.
â€“Â Jerry Coffin
Apr 17 at 20:31

5

Sorting seems like an overkill. Count letters and compare histograms in $O(N)$.
â€“Â vnp
Apr 17 at 21:00

3

@vnp: I am betting that sorting is the quickest for short strings.
â€“Â Martin York
Apr 17 at 23:26

@JerryCoffin Surprised with array results. Did you just replace the unordered_map with C-Array?
â€“Â Martin York
Apr 17 at 23:30

Â |Â
show 7 more comments

Can you explain to me why it is O(NlogN)? I don't understand how my solution is O(NlogN) as well as yours?
â€“Â John DeBord
Apr 17 at 20:30

@JohnDeBord: Sorry--I hadn't read carefully enough (or maybe just my brain slipped a cog), and thought you were using a multimap instead of unordered_multimap.
â€“Â Jerry Coffin
Apr 17 at 20:31

5

Sorting seems like an overkill. Count letters and compare histograms in $O(N)$.
â€“Â vnp
Apr 17 at 21:00

3

@vnp: I am betting that sorting is the quickest for short strings.
â€“Â Martin York
Apr 17 at 23:26

@JerryCoffin Surprised with array results. Did you just replace the unordered_map with C-Array?
â€“Â Martin York
Apr 17 at 23:30

Can you explain to me why it is O(NlogN)? I don't understand how my solution is O(NlogN) as well as yours?
â€“Â John DeBord
Apr 17 at 20:30

@JohnDeBord: Sorry--I hadn't read carefully enough (or maybe just my brain slipped a cog), and thought you were using a multimap instead of unordered_multimap.
â€“Â Jerry Coffin
Apr 17 at 20:31

Sorting seems like an overkill. Count letters and compare histograms in $O(N)$.
â€“Â vnp
Apr 17 at 21:00

@vnp: I am betting that sorting is the quickest for short strings.
â€“Â Martin York
Apr 17 at 23:26

@JerryCoffin Surprised with array results. Did you just replace the unordered_map with C-Array?
â€“Â Martin York
Apr 17 at 23:30

Â |Â
show 7 more comments

up vote
15
down vote

I would say yes, use an array.

Arrays afford very good memory locality, which is good for cache performance. They essentially are the most basic and efficient hash map if the key is the index.

bool is_permutation(const std::string& s1, const std::string& s2)

 if(s1.length() != s2.length())
 return false;

 // There are usually 256 possible characters but on rare architectures may be more.
 constexpr possible_char_count = std::numeric_limits<unsigned char>::max()+1;
 std::array<size_t, possible_char_count > m1 = ; // Zero initialized all places

 for(char c : s1) // Count up for each character in the first string
 m1[static_cast<unsigned char>(c)]++; 
 
 for(char c : s2) // Count down for each character in the second string
 m1[static_cast<unsigned char>(c)]--; 
 

 for (size_t t : m1) 
 // If any character appears a net nonzero number of times,
 // the two strings are not anagrams.
 if (t != 0) 
 return false;
 
 
 return true;

edited Apr 18 at 22:04

answered Apr 17 at 23:24

Josiah

3,182326

3

Note that 1 << CHAR_BIT is not guaranteed to be 256.
â€“Â wchargin
Apr 18 at 7:48

I would recommend using something else than long long int: either assert that the strings are short-ish (< 2GB each) and use std::int32_t or aim for space and use std::int64_t. The latter, of course, incurs a significant memory overhead.
â€“Â Matthieu M.
Apr 18 at 13:12

1

For those following at home; the assembly of the long long int and int versions => godbolt.org/g/pGpHFt . As expected (1) the initialization and final checks are optimized and (2) the int versions requires only half as many iterations.
â€“Â Matthieu M.
Apr 18 at 13:18

add a commentÂ |Â

up vote
15
down vote

I would say yes, use an array.

Arrays afford very good memory locality, which is good for cache performance. They essentially are the most basic and efficient hash map if the key is the index.

bool is_permutation(const std::string& s1, const std::string& s2)

 if(s1.length() != s2.length())
 return false;

 // There are usually 256 possible characters but on rare architectures may be more.
 constexpr possible_char_count = std::numeric_limits<unsigned char>::max()+1;
 std::array<size_t, possible_char_count > m1 = ; // Zero initialized all places

 for(char c : s1) // Count up for each character in the first string
 m1[static_cast<unsigned char>(c)]++; 
 
 for(char c : s2) // Count down for each character in the second string
 m1[static_cast<unsigned char>(c)]--; 
 

 for (size_t t : m1) 
 // If any character appears a net nonzero number of times,
 // the two strings are not anagrams.
 if (t != 0) 
 return false;
 
 
 return true;

edited Apr 18 at 22:04

answered Apr 17 at 23:24

Josiah

3,182326

3

Note that 1 << CHAR_BIT is not guaranteed to be 256.
â€“Â wchargin
Apr 18 at 7:48

I would recommend using something else than long long int: either assert that the strings are short-ish (< 2GB each) and use std::int32_t or aim for space and use std::int64_t. The latter, of course, incurs a significant memory overhead.
â€“Â Matthieu M.
Apr 18 at 13:12

1

For those following at home; the assembly of the long long int and int versions => godbolt.org/g/pGpHFt . As expected (1) the initialization and final checks are optimized and (2) the int versions requires only half as many iterations.
â€“Â Matthieu M.
Apr 18 at 13:18

add a commentÂ |Â

up vote
15
down vote

I would say yes, use an array.

Arrays afford very good memory locality, which is good for cache performance. They essentially are the most basic and efficient hash map if the key is the index.

bool is_permutation(const std::string& s1, const std::string& s2)

 if(s1.length() != s2.length())
 return false;

 // There are usually 256 possible characters but on rare architectures may be more.
 constexpr possible_char_count = std::numeric_limits<unsigned char>::max()+1;
 std::array<size_t, possible_char_count > m1 = ; // Zero initialized all places

 for(char c : s1) // Count up for each character in the first string
 m1[static_cast<unsigned char>(c)]++; 
 
 for(char c : s2) // Count down for each character in the second string
 m1[static_cast<unsigned char>(c)]--; 
 

 for (size_t t : m1) 
 // If any character appears a net nonzero number of times,
 // the two strings are not anagrams.
 if (t != 0) 
 return false;
 
 
 return true;

edited Apr 18 at 22:04

answered Apr 17 at 23:24

Josiah

3,182326

I would say yes, use an array.

Arrays afford very good memory locality, which is good for cache performance. They essentially are the most basic and efficient hash map if the key is the index.

bool is_permutation(const std::string& s1, const std::string& s2)

 if(s1.length() != s2.length())
 return false;

 // There are usually 256 possible characters but on rare architectures may be more.
 constexpr possible_char_count = std::numeric_limits<unsigned char>::max()+1;
 std::array<size_t, possible_char_count > m1 = ; // Zero initialized all places

 for(char c : s1) // Count up for each character in the first string
 m1[static_cast<unsigned char>(c)]++; 
 
 for(char c : s2) // Count down for each character in the second string
 m1[static_cast<unsigned char>(c)]--; 
 

 for (size_t t : m1) 
 // If any character appears a net nonzero number of times,
 // the two strings are not anagrams.
 if (t != 0) 
 return false;
 
 
 return true;

edited Apr 18 at 22:04

answered Apr 17 at 23:24

Josiah

3,182326

edited Apr 18 at 22:04

answered Apr 17 at 23:24

Josiah

3,182326

answered Apr 17 at 23:24

Josiah

3,182326

answered Apr 17 at 23:24

Josiah

3,182326

3

Note that 1 << CHAR_BIT is not guaranteed to be 256.
â€“Â wchargin
Apr 18 at 7:48

I would recommend using something else than long long int: either assert that the strings are short-ish (< 2GB each) and use std::int32_t or aim for space and use std::int64_t. The latter, of course, incurs a significant memory overhead.
â€“Â Matthieu M.
Apr 18 at 13:12

1

For those following at home; the assembly of the long long int and int versions => godbolt.org/g/pGpHFt . As expected (1) the initialization and final checks are optimized and (2) the int versions requires only half as many iterations.
â€“Â Matthieu M.
Apr 18 at 13:18

add a commentÂ |Â

3

Note that 1 << CHAR_BIT is not guaranteed to be 256.
â€“Â wchargin
Apr 18 at 7:48

I would recommend using something else than long long int: either assert that the strings are short-ish (< 2GB each) and use std::int32_t or aim for space and use std::int64_t. The latter, of course, incurs a significant memory overhead.
â€“Â Matthieu M.
Apr 18 at 13:12

1

For those following at home; the assembly of the long long int and int versions => godbolt.org/g/pGpHFt . As expected (1) the initialization and final checks are optimized and (2) the int versions requires only half as many iterations.
â€“Â Matthieu M.
Apr 18 at 13:18

Note that 1 << CHAR_BIT is not guaranteed to be 256.
â€“Â wchargin
Apr 18 at 7:48

I would recommend using something else than long long int: either assert that the strings are short-ish (< 2GB each) and use std::int32_t or aim for space and use std::int64_t. The latter, of course, incurs a significant memory overhead.
â€“Â Matthieu M.
Apr 18 at 13:12

For those following at home; the assembly of the long long int and int versions => godbolt.org/g/pGpHFt . As expected (1) the initialization and final checks are optimized and (2) the int versions requires only half as many iterations.
â€“Â Matthieu M.
Apr 18 at 13:18

add a commentÂ |Â

up vote
6
down vote

You have got the basics.

Though @Jerry Coffin points out some good points.

So I would change to use an array of integers. One integer for each character. Though lets use a vector do easier comparisons.

// Stealing @Jerry Coffin code
bool is_perm2(std::string const &a, std::string const &b) 

 if (a.length() != b.length())
 return false;

 std::array<int, 1 + UCHAR_MAX> s1;
 std::array<int, 1 + UCHAR_MAX> s2;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

Notes:

Complexity O(n).

Overhead per iteration very small.

Overhead high for initialization.

edited Apr 18 at 16:58

answered Apr 17 at 23:23

Martin York

70.8k481244

2

What's the significance of casting unsigned char to unsigned char? Isn't that just an identity? For real portability, I'd write 1 + UCHAR_MAX rather than 256,
â€“Â Toby Speight
Apr 18 at 9:27

Shouldn't that be char or preferable auto for c?
â€“Â Deduplicator
Apr 18 at 16:21

add a commentÂ |Â

up vote
6
down vote

You have got the basics.

Though @Jerry Coffin points out some good points.

So I would change to use an array of integers. One integer for each character. Though lets use a vector do easier comparisons.

// Stealing @Jerry Coffin code
bool is_perm2(std::string const &a, std::string const &b) 

 if (a.length() != b.length())
 return false;

 std::array<int, 1 + UCHAR_MAX> s1;
 std::array<int, 1 + UCHAR_MAX> s2;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

Notes:

Complexity O(n).

Overhead per iteration very small.

Overhead high for initialization.

edited Apr 18 at 16:58

answered Apr 17 at 23:23

Martin York

70.8k481244

2

What's the significance of casting unsigned char to unsigned char? Isn't that just an identity? For real portability, I'd write 1 + UCHAR_MAX rather than 256,
â€“Â Toby Speight
Apr 18 at 9:27

Shouldn't that be char or preferable auto for c?
â€“Â Deduplicator
Apr 18 at 16:21

add a commentÂ |Â

up vote
6
down vote

You have got the basics.

Though @Jerry Coffin points out some good points.

So I would change to use an array of integers. One integer for each character. Though lets use a vector do easier comparisons.

// Stealing @Jerry Coffin code
bool is_perm2(std::string const &a, std::string const &b) 

 if (a.length() != b.length())
 return false;

 std::array<int, 1 + UCHAR_MAX> s1;
 std::array<int, 1 + UCHAR_MAX> s2;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

Notes:

Complexity O(n).

Overhead per iteration very small.

Overhead high for initialization.

edited Apr 18 at 16:58

answered Apr 17 at 23:23

Martin York

70.8k481244

You have got the basics.

Though @Jerry Coffin points out some good points.

So I would change to use an array of integers. One integer for each character. Though lets use a vector do easier comparisons.

// Stealing @Jerry Coffin code
bool is_perm2(std::string const &a, std::string const &b) 

 if (a.length() != b.length())
 return false;

 std::array<int, 1 + UCHAR_MAX> s1;
 std::array<int, 1 + UCHAR_MAX> s2;

 for (unsigned char c : a)
 ++s1[c];
 for (unsigned char c : b)
 ++s2[c];
 return s1 == s2;

Notes:

Complexity O(n).

Overhead per iteration very small.

Overhead high for initialization.

edited Apr 18 at 16:58

answered Apr 17 at 23:23

Martin York

70.8k481244

edited Apr 18 at 16:58

answered Apr 17 at 23:23

Martin York

70.8k481244

answered Apr 17 at 23:23

Martin York

70.8k481244

answered Apr 17 at 23:23

Martin York

70.8k481244

2

What's the significance of casting unsigned char to unsigned char? Isn't that just an identity? For real portability, I'd write 1 + UCHAR_MAX rather than 256,
â€“Â Toby Speight
Apr 18 at 9:27

Shouldn't that be char or preferable auto for c?
â€“Â Deduplicator
Apr 18 at 16:21

add a commentÂ |Â

2

What's the significance of casting unsigned char to unsigned char? Isn't that just an identity? For real portability, I'd write 1 + UCHAR_MAX rather than 256,
â€“Â Toby Speight
Apr 18 at 9:27

Shouldn't that be char or preferable auto for c?
â€“Â Deduplicator
Apr 18 at 16:21

What's the significance of casting unsigned char to unsigned char? Isn't that just an identity? For real portability, I'd write 1 + UCHAR_MAX rather than 256,
â€“Â Toby Speight
Apr 18 at 9:27

Shouldn't that be char or preferable auto for c?
â€“Â Deduplicator
Apr 18 at 16:21

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr