Safe, simultaneous string replacement
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I've written a function which takes as an input:
- a string to be modified
- a vector of regular expressions on which to match
- a vector of replacements
- option to ignore case in the match statements
The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').
Examples:
rcpp_mgsub("hey ho","hey","ho","ho","hey",true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true)
// Expect "how, are you hey?"
I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.
So, I'm curious what inefficiencies are present in the rcpp_mgsub
function that I should focus on improving.
#include <iostream>
#include <string>
#include <regex>
std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic)
std::string newString = "";
std::smatch matches;
std::string prefix = string;
std::string detected;
std::string suffix;
std::regex::flag_type flags;
flags
int main()
std::cout << rcpp_mgsub("hey ho hey","hey","ho","ho","hey",true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.","[Dd]opa(.*?mine)","fake","Meta\1","real",false);
return(0);
c++ performance c++11
add a comment |Â
up vote
2
down vote
favorite
I've written a function which takes as an input:
- a string to be modified
- a vector of regular expressions on which to match
- a vector of replacements
- option to ignore case in the match statements
The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').
Examples:
rcpp_mgsub("hey ho","hey","ho","ho","hey",true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true)
// Expect "how, are you hey?"
I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.
So, I'm curious what inefficiencies are present in the rcpp_mgsub
function that I should focus on improving.
#include <iostream>
#include <string>
#include <regex>
std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic)
std::string newString = "";
std::smatch matches;
std::string prefix = string;
std::string detected;
std::string suffix;
std::regex::flag_type flags;
flags
int main()
std::cout << rcpp_mgsub("hey ho hey","hey","ho","ho","hey",true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.","[Dd]opa(.*?mine)","fake","Meta\1","real",false);
return(0);
c++ performance c++11
Are you compiling with optimizations enabled? With GCC and amain()
which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between-O0
and-O3
. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.
â Toby Speight
Jan 19 at 11:50
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I've written a function which takes as an input:
- a string to be modified
- a vector of regular expressions on which to match
- a vector of replacements
- option to ignore case in the match statements
The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').
Examples:
rcpp_mgsub("hey ho","hey","ho","ho","hey",true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true)
// Expect "how, are you hey?"
I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.
So, I'm curious what inefficiencies are present in the rcpp_mgsub
function that I should focus on improving.
#include <iostream>
#include <string>
#include <regex>
std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic)
std::string newString = "";
std::smatch matches;
std::string prefix = string;
std::string detected;
std::string suffix;
std::regex::flag_type flags;
flags
int main()
std::cout << rcpp_mgsub("hey ho hey","hey","ho","ho","hey",true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.","[Dd]opa(.*?mine)","fake","Meta\1","real",false);
return(0);
c++ performance c++11
I've written a function which takes as an input:
- a string to be modified
- a vector of regular expressions on which to match
- a vector of replacements
- option to ignore case in the match statements
The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').
Examples:
rcpp_mgsub("hey ho","hey","ho","ho","hey",true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true)
// Expect "how, are you hey?"
I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.
So, I'm curious what inefficiencies are present in the rcpp_mgsub
function that I should focus on improving.
#include <iostream>
#include <string>
#include <regex>
std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic)
std::string newString = "";
std::smatch matches;
std::string prefix = string;
std::string detected;
std::string suffix;
std::regex::flag_type flags;
flags
int main()
std::cout << rcpp_mgsub("hey ho hey","hey","ho","ho","hey",true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.","[Dd]opa(.*?mine)","fake","Meta\1","real",false);
return(0);
c++ performance c++11
edited Jan 22 at 14:23
Jamalâ¦
30.1k11114225
30.1k11114225
asked Jan 17 at 14:22
Mark
1113
1113
Are you compiling with optimizations enabled? With GCC and amain()
which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between-O0
and-O3
. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.
â Toby Speight
Jan 19 at 11:50
add a comment |Â
Are you compiling with optimizations enabled? With GCC and amain()
which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between-O0
and-O3
. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.
â Toby Speight
Jan 19 at 11:50
Are you compiling with optimizations enabled? With GCC and a
main()
which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between -O0
and -O3
. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.â Toby Speight
Jan 19 at 11:50
Are you compiling with optimizations enabled? With GCC and a
main()
which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between -O0
and -O3
. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.â Toby Speight
Jan 19 at 11:50
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
2
down vote
What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.
Do you really need backreferences? That usually makes regex much slower.
It's very unusual to use a variable with the same name as the variable's type.
You don't need to initialize prefix=string
before the loop, it's done first thing inside the loop.
Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.
It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
â Mark
Jan 17 at 14:36
By backreference do you mean the using()
and\1
in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I addedstd::regex_constants::format_sed
to support using backreferences the same way R does it.
â Mark
Jan 17 at 14:49
So you're on Windows using a recent mingw version of GCC then, I take it.
â Snowbody
Jan 17 at 14:50
Yes. on Windows. Sorry for missing the most obvious part of that response.
â Mark
Jan 17 at 14:50
For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/â¦
â Snowbody
Jan 18 at 13:39
add a comment |Â
up vote
2
down vote
Headers
You forgot to include <vector>
.
Only the test program requires <iostream>
, so consider moving it to immediately before main()
, to make it easier to separate the implementation and its tests when the time comes.
Interface
You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:
- Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.
- The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a
std::regex_constants::syntax_option_type
to be used; this would also allow the caller to choose different regex grammars.
I think I would write the interface something like this:
std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);
// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););
return rcpp_mgsub(string, replacements);
Algorithm
After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.
I have the beginnings of an implementation that does this, but it fails at the ++it;
line. I'll post it here while I continue debugging:
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>
std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)
static const std::sregex_iterator no_match = ;
using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;
std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);
std::string result = ;
auto position = s.begin();
while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)
// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;
// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;
Compilation
The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main()
to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):
#include <iostream>
int main()
std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";
std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);
std::cout << s << std::endl ;
I found a large difference between g++ -O0
and g++ -O3
on this code (roughly a factor of 10ÃÂ). Quite a large part of this program comes from expanding templates from the <regex>
header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.
Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.
wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
â Mark
Jan 25 at 19:48
As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
â Toby Speight
Jan 26 at 11:11
Very helpful! Thank you!
â Mark
Jan 26 at 11:14
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.
Do you really need backreferences? That usually makes regex much slower.
It's very unusual to use a variable with the same name as the variable's type.
You don't need to initialize prefix=string
before the loop, it's done first thing inside the loop.
Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.
It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
â Mark
Jan 17 at 14:36
By backreference do you mean the using()
and\1
in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I addedstd::regex_constants::format_sed
to support using backreferences the same way R does it.
â Mark
Jan 17 at 14:49
So you're on Windows using a recent mingw version of GCC then, I take it.
â Snowbody
Jan 17 at 14:50
Yes. on Windows. Sorry for missing the most obvious part of that response.
â Mark
Jan 17 at 14:50
For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/â¦
â Snowbody
Jan 18 at 13:39
add a comment |Â
up vote
2
down vote
What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.
Do you really need backreferences? That usually makes regex much slower.
It's very unusual to use a variable with the same name as the variable's type.
You don't need to initialize prefix=string
before the loop, it's done first thing inside the loop.
Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.
It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
â Mark
Jan 17 at 14:36
By backreference do you mean the using()
and\1
in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I addedstd::regex_constants::format_sed
to support using backreferences the same way R does it.
â Mark
Jan 17 at 14:49
So you're on Windows using a recent mingw version of GCC then, I take it.
â Snowbody
Jan 17 at 14:50
Yes. on Windows. Sorry for missing the most obvious part of that response.
â Mark
Jan 17 at 14:50
For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/â¦
â Snowbody
Jan 18 at 13:39
add a comment |Â
up vote
2
down vote
up vote
2
down vote
What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.
Do you really need backreferences? That usually makes regex much slower.
It's very unusual to use a variable with the same name as the variable's type.
You don't need to initialize prefix=string
before the loop, it's done first thing inside the loop.
Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.
What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.
Do you really need backreferences? That usually makes regex much slower.
It's very unusual to use a variable with the same name as the variable's type.
You don't need to initialize prefix=string
before the loop, it's done first thing inside the loop.
Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.
answered Jan 17 at 14:34
Snowbody
7,7371343
7,7371343
It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
â Mark
Jan 17 at 14:36
By backreference do you mean the using()
and\1
in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I addedstd::regex_constants::format_sed
to support using backreferences the same way R does it.
â Mark
Jan 17 at 14:49
So you're on Windows using a recent mingw version of GCC then, I take it.
â Snowbody
Jan 17 at 14:50
Yes. on Windows. Sorry for missing the most obvious part of that response.
â Mark
Jan 17 at 14:50
For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/â¦
â Snowbody
Jan 18 at 13:39
add a comment |Â
It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
â Mark
Jan 17 at 14:36
By backreference do you mean the using()
and\1
in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I addedstd::regex_constants::format_sed
to support using backreferences the same way R does it.
â Mark
Jan 17 at 14:49
So you're on Windows using a recent mingw version of GCC then, I take it.
â Snowbody
Jan 17 at 14:50
Yes. on Windows. Sorry for missing the most obvious part of that response.
â Mark
Jan 17 at 14:50
For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/â¦
â Snowbody
Jan 18 at 13:39
It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
â Mark
Jan 17 at 14:36
It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
â Mark
Jan 17 at 14:36
By backreference do you mean the using
()
and \1
in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed
to support using backreferences the same way R does it.â Mark
Jan 17 at 14:49
By backreference do you mean the using
()
and \1
in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed
to support using backreferences the same way R does it.â Mark
Jan 17 at 14:49
So you're on Windows using a recent mingw version of GCC then, I take it.
â Snowbody
Jan 17 at 14:50
So you're on Windows using a recent mingw version of GCC then, I take it.
â Snowbody
Jan 17 at 14:50
Yes. on Windows. Sorry for missing the most obvious part of that response.
â Mark
Jan 17 at 14:50
Yes. on Windows. Sorry for missing the most obvious part of that response.
â Mark
Jan 17 at 14:50
For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/â¦
â Snowbody
Jan 18 at 13:39
For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/â¦
â Snowbody
Jan 18 at 13:39
add a comment |Â
up vote
2
down vote
Headers
You forgot to include <vector>
.
Only the test program requires <iostream>
, so consider moving it to immediately before main()
, to make it easier to separate the implementation and its tests when the time comes.
Interface
You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:
- Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.
- The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a
std::regex_constants::syntax_option_type
to be used; this would also allow the caller to choose different regex grammars.
I think I would write the interface something like this:
std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);
// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););
return rcpp_mgsub(string, replacements);
Algorithm
After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.
I have the beginnings of an implementation that does this, but it fails at the ++it;
line. I'll post it here while I continue debugging:
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>
std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)
static const std::sregex_iterator no_match = ;
using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;
std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);
std::string result = ;
auto position = s.begin();
while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)
// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;
// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;
Compilation
The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main()
to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):
#include <iostream>
int main()
std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";
std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);
std::cout << s << std::endl ;
I found a large difference between g++ -O0
and g++ -O3
on this code (roughly a factor of 10ÃÂ). Quite a large part of this program comes from expanding templates from the <regex>
header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.
Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.
wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
â Mark
Jan 25 at 19:48
As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
â Toby Speight
Jan 26 at 11:11
Very helpful! Thank you!
â Mark
Jan 26 at 11:14
add a comment |Â
up vote
2
down vote
Headers
You forgot to include <vector>
.
Only the test program requires <iostream>
, so consider moving it to immediately before main()
, to make it easier to separate the implementation and its tests when the time comes.
Interface
You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:
- Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.
- The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a
std::regex_constants::syntax_option_type
to be used; this would also allow the caller to choose different regex grammars.
I think I would write the interface something like this:
std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);
// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););
return rcpp_mgsub(string, replacements);
Algorithm
After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.
I have the beginnings of an implementation that does this, but it fails at the ++it;
line. I'll post it here while I continue debugging:
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>
std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)
static const std::sregex_iterator no_match = ;
using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;
std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);
std::string result = ;
auto position = s.begin();
while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)
// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;
// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;
Compilation
The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main()
to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):
#include <iostream>
int main()
std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";
std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);
std::cout << s << std::endl ;
I found a large difference between g++ -O0
and g++ -O3
on this code (roughly a factor of 10ÃÂ). Quite a large part of this program comes from expanding templates from the <regex>
header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.
Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.
wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
â Mark
Jan 25 at 19:48
As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
â Toby Speight
Jan 26 at 11:11
Very helpful! Thank you!
â Mark
Jan 26 at 11:14
add a comment |Â
up vote
2
down vote
up vote
2
down vote
Headers
You forgot to include <vector>
.
Only the test program requires <iostream>
, so consider moving it to immediately before main()
, to make it easier to separate the implementation and its tests when the time comes.
Interface
You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:
- Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.
- The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a
std::regex_constants::syntax_option_type
to be used; this would also allow the caller to choose different regex grammars.
I think I would write the interface something like this:
std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);
// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););
return rcpp_mgsub(string, replacements);
Algorithm
After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.
I have the beginnings of an implementation that does this, but it fails at the ++it;
line. I'll post it here while I continue debugging:
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>
std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)
static const std::sregex_iterator no_match = ;
using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;
std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);
std::string result = ;
auto position = s.begin();
while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)
// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;
// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;
Compilation
The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main()
to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):
#include <iostream>
int main()
std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";
std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);
std::cout << s << std::endl ;
I found a large difference between g++ -O0
and g++ -O3
on this code (roughly a factor of 10ÃÂ). Quite a large part of this program comes from expanding templates from the <regex>
header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.
Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.
Headers
You forgot to include <vector>
.
Only the test program requires <iostream>
, so consider moving it to immediately before main()
, to make it easier to separate the implementation and its tests when the time comes.
Interface
You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:
- Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.
- The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a
std::regex_constants::syntax_option_type
to be used; this would also allow the caller to choose different regex grammars.
I think I would write the interface something like this:
std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);
// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););
return rcpp_mgsub(string, replacements);
Algorithm
After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.
I have the beginnings of an implementation that does this, but it fails at the ++it;
line. I'll post it here while I continue debugging:
#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>
std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)
static const std::sregex_iterator no_match = ;
using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;
std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);
std::string result = ;
auto position = s.begin();
while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)
// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;
// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;
Compilation
The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main()
to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):
#include <iostream>
int main()
std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";
std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);
std::cout << s << std::endl ;
I found a large difference between g++ -O0
and g++ -O3
on this code (roughly a factor of 10ÃÂ). Quite a large part of this program comes from expanding templates from the <regex>
header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.
Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.
edited Jan 30 at 10:04
answered Jan 25 at 18:44
Toby Speight
17.8k13491
17.8k13491
wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
â Mark
Jan 25 at 19:48
As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
â Toby Speight
Jan 26 at 11:11
Very helpful! Thank you!
â Mark
Jan 26 at 11:14
add a comment |Â
wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
â Mark
Jan 25 at 19:48
As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
â Toby Speight
Jan 26 at 11:11
Very helpful! Thank you!
â Mark
Jan 26 at 11:14
wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
â Mark
Jan 25 at 19:48
wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
â Mark
Jan 25 at 19:48
As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
â Toby Speight
Jan 26 at 11:11
As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
â Toby Speight
Jan 26 at 11:11
Very helpful! Thank you!
â Mark
Jan 26 at 11:14
Very helpful! Thank you!
â Mark
Jan 26 at 11:14
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f185312%2fsafe-simultaneous-string-replacement%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Are you compiling with optimizations enabled? With GCC and a
main()
which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between-O0
and-O3
. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.â Toby Speight
Jan 19 at 11:50