Safe, simultaneous string replacement

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I've written a function which takes as an input:



  • a string to be modified

  • a vector of regular expressions on which to match

  • a vector of replacements

  • option to ignore case in the match statements

The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').



Examples:



rcpp_mgsub("hey ho","hey","ho","ho","hey",true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true)
// Expect "how, are you hey?"


I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.



So, I'm curious what inefficiencies are present in the rcpp_mgsub function that I should focus on improving.



#include <iostream>
#include <string>
#include <regex>

std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic)
std::string newString = "";
std::smatch matches;
std::string prefix = string;
std::string detected;
std::string suffix;

std::regex::flag_type flags;
flags


int main()

std::cout << rcpp_mgsub("hey ho hey","hey","ho","ho","hey",true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.","[Dd]opa(.*?mine)","fake","Meta\1","real",false);
return(0);







share|improve this question





















  • Are you compiling with optimizations enabled? With GCC and a main() which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between -O0 and -O3. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.
    – Toby Speight
    Jan 19 at 11:50

















up vote
2
down vote

favorite












I've written a function which takes as an input:



  • a string to be modified

  • a vector of regular expressions on which to match

  • a vector of replacements

  • option to ignore case in the match statements

The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').



Examples:



rcpp_mgsub("hey ho","hey","ho","ho","hey",true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true)
// Expect "how, are you hey?"


I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.



So, I'm curious what inefficiencies are present in the rcpp_mgsub function that I should focus on improving.



#include <iostream>
#include <string>
#include <regex>

std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic)
std::string newString = "";
std::smatch matches;
std::string prefix = string;
std::string detected;
std::string suffix;

std::regex::flag_type flags;
flags


int main()

std::cout << rcpp_mgsub("hey ho hey","hey","ho","ho","hey",true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.","[Dd]opa(.*?mine)","fake","Meta\1","real",false);
return(0);







share|improve this question





















  • Are you compiling with optimizations enabled? With GCC and a main() which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between -O0 and -O3. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.
    – Toby Speight
    Jan 19 at 11:50













up vote
2
down vote

favorite









up vote
2
down vote

favorite











I've written a function which takes as an input:



  • a string to be modified

  • a vector of regular expressions on which to match

  • a vector of replacements

  • option to ignore case in the match statements

The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').



Examples:



rcpp_mgsub("hey ho","hey","ho","ho","hey",true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true)
// Expect "how, are you hey?"


I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.



So, I'm curious what inefficiencies are present in the rcpp_mgsub function that I should focus on improving.



#include <iostream>
#include <string>
#include <regex>

std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic)
std::string newString = "";
std::smatch matches;
std::string prefix = string;
std::string detected;
std::string suffix;

std::regex::flag_type flags;
flags


int main()

std::cout << rcpp_mgsub("hey ho hey","hey","ho","ho","hey",true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.","[Dd]opa(.*?mine)","fake","Meta\1","real",false);
return(0);







share|improve this question













I've written a function which takes as an input:



  • a string to be modified

  • a vector of regular expressions on which to match

  • a vector of replacements

  • option to ignore case in the match statements

The function will then, from the users perspective, simultaneously perform all replacements in a 'safe' fashion. By 'safe' I mean sub-string matches are ignored in favor of longer matches (if 'the' and 'they' are patterns to match, ignore the sub-string 'the' in 'they') and also allow users to have cross-replacement (swap 'hey' for 'ho' and 'ho' for 'hey').



Examples:



rcpp_mgsub("hey ho","hey","ho","ho","hey",true)
// Expect "ho hey"
rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true)
// Expect "how, are you hey?"


I've written a function in the R programming language that accomplishes this, but R isn't a fast language. R libraries can include pre-compiled C++11 code which can result in vastly faster functions. To this end, I've written a C++11 function but the performance is actually significantly slower than R (3-7x slower in various comparisons I've done) so I realize there must be some significant inefficiencies in my function.



So, I'm curious what inefficiencies are present in the rcpp_mgsub function that I should focus on improving.



#include <iostream>
#include <string>
#include <regex>

std::string rcpp_mgsub(std::string string, std::vector<std::string> const& match, std::vector<std::string> const& replace, bool const& ic)
std::string newString = "";
std::smatch matches;
std::string prefix = string;
std::string detected;
std::string suffix;

std::regex::flag_type flags;
flags


int main()

std::cout << rcpp_mgsub("hey ho hey","hey","ho","ho","hey",true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?","hey","how","are","you","how","are","you","hey",true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.","[Dd]opa(.*?mine)","fake","Meta\1","real",false);
return(0);









share|improve this question












share|improve this question




share|improve this question








edited Jan 22 at 14:23









Jamal♦

30.1k11114225




30.1k11114225









asked Jan 17 at 14:22









Mark

1113




1113











  • Are you compiling with optimizations enabled? With GCC and a main() which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between -O0 and -O3. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.
    – Toby Speight
    Jan 19 at 11:50

















  • Are you compiling with optimizations enabled? With GCC and a main() which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between -O0 and -O3. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.
    – Toby Speight
    Jan 19 at 11:50
















Are you compiling with optimizations enabled? With GCC and a main() which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between -O0 and -O3. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.
– Toby Speight
Jan 19 at 11:50





Are you compiling with optimizations enabled? With GCC and a main() which performs the last substitution and its reverse 20000 times, I get a factor of 10 difference between -O0 and -O3. I suspect that there's a fair bit of template code that's subject to your code generation options, plus a lot that can be done by the link-time optimizer. I've got some other ideas, but it might take until next week before I get time to try them and write a review.
– Toby Speight
Jan 19 at 11:50











2 Answers
2






active

oldest

votes

















up vote
2
down vote













What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.



Do you really need backreferences? That usually makes regex much slower.



It's very unusual to use a variable with the same name as the variable's type.



You don't need to initialize prefix=string before the loop, it's done first thing inside the loop.



Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.






share|improve this answer





















  • It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
    – Mark
    Jan 17 at 14:36











  • By backreference do you mean the using () and \1 in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed to support using backreferences the same way R does it.
    – Mark
    Jan 17 at 14:49










  • So you're on Windows using a recent mingw version of GCC then, I take it.
    – Snowbody
    Jan 17 at 14:50










  • Yes. on Windows. Sorry for missing the most obvious part of that response.
    – Mark
    Jan 17 at 14:50










  • For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/…
    – Snowbody
    Jan 18 at 13:39

















up vote
2
down vote













Headers



You forgot to include <vector>.



Only the test program requires <iostream>, so consider moving it to immediately before main(), to make it easier to separate the implementation and its tests when the time comes.



Interface



You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:



  • Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.

  • The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a std::regex_constants::syntax_option_type to be used; this would also allow the caller to choose different regex grammars.

I think I would write the interface something like this:



std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);




// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););

return rcpp_mgsub(string, replacements);



Algorithm



After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.



I have the beginnings of an implementation that does this, but it fails at the ++it; line. I'll post it here while I continue debugging:



#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>

std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)

static const std::sregex_iterator no_match = ;

using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;

std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);

std::string result = ;
auto position = s.begin();

while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)

// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;


// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;




Compilation



The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main() to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):



#include <iostream>

int main()

std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";

std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);

std::cout << s << std::endl ;



I found a large difference between g++ -O0 and g++ -O3 on this code (roughly a factor of 10×). Quite a large part of this program comes from expanding templates from the <regex> header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.



Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.






share|improve this answer























  • wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
    – Mark
    Jan 25 at 19:48











  • As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
    – Toby Speight
    Jan 26 at 11:11










  • Very helpful! Thank you!
    – Mark
    Jan 26 at 11:14










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);








 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f185312%2fsafe-simultaneous-string-replacement%23new-answer', 'question_page');

);

Post as a guest






























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote













What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.



Do you really need backreferences? That usually makes regex much slower.



It's very unusual to use a variable with the same name as the variable's type.



You don't need to initialize prefix=string before the loop, it's done first thing inside the loop.



Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.






share|improve this answer





















  • It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
    – Mark
    Jan 17 at 14:36











  • By backreference do you mean the using () and \1 in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed to support using backreferences the same way R does it.
    – Mark
    Jan 17 at 14:49










  • So you're on Windows using a recent mingw version of GCC then, I take it.
    – Snowbody
    Jan 17 at 14:50










  • Yes. on Windows. Sorry for missing the most obvious part of that response.
    – Mark
    Jan 17 at 14:50










  • For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/…
    – Snowbody
    Jan 18 at 13:39














up vote
2
down vote













What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.



Do you really need backreferences? That usually makes regex much slower.



It's very unusual to use a variable with the same name as the variable's type.



You don't need to initialize prefix=string before the loop, it's done first thing inside the loop.



Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.






share|improve this answer





















  • It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
    – Mark
    Jan 17 at 14:36











  • By backreference do you mean the using () and \1 in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed to support using backreferences the same way R does it.
    – Mark
    Jan 17 at 14:49










  • So you're on Windows using a recent mingw version of GCC then, I take it.
    – Snowbody
    Jan 17 at 14:50










  • Yes. on Windows. Sorry for missing the most obvious part of that response.
    – Mark
    Jan 17 at 14:50










  • For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/…
    – Snowbody
    Jan 18 at 13:39












up vote
2
down vote










up vote
2
down vote









What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.



Do you really need backreferences? That usually makes regex much slower.



It's very unusual to use a variable with the same name as the variable's type.



You don't need to initialize prefix=string before the loop, it's done first thing inside the loop.



Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.






share|improve this answer













What compiler are you using? Some implementations of the C++ standard library are better than others for supporting Regex.



Do you really need backreferences? That usually makes regex much slower.



It's very unusual to use a variable with the same name as the variable's type.



You don't need to initialize prefix=string before the loop, it's done first thing inside the loop.



Have you done any profiling of your code to see which lines are slowing it down? It'd be very good to know if the slowness is due to the regex or due to the string concatenation.







share|improve this answer













share|improve this answer



share|improve this answer











answered Jan 17 at 14:34









Snowbody

7,7371343




7,7371343











  • It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
    – Mark
    Jan 17 at 14:36











  • By backreference do you mean the using () and \1 in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed to support using backreferences the same way R does it.
    – Mark
    Jan 17 at 14:49










  • So you're on Windows using a recent mingw version of GCC then, I take it.
    – Snowbody
    Jan 17 at 14:50










  • Yes. on Windows. Sorry for missing the most obvious part of that response.
    – Mark
    Jan 17 at 14:50










  • For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/…
    – Snowbody
    Jan 18 at 13:39
















  • It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
    – Mark
    Jan 17 at 14:36











  • By backreference do you mean the using () and \1 in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed to support using backreferences the same way R does it.
    – Mark
    Jan 17 at 14:49










  • So you're on Windows using a recent mingw version of GCC then, I take it.
    – Snowbody
    Jan 17 at 14:50










  • Yes. on Windows. Sorry for missing the most obvious part of that response.
    – Mark
    Jan 17 at 14:50










  • For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/…
    – Snowbody
    Jan 18 at 13:39















It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
– Mark
Jan 17 at 14:36





It's a mingw_64 compiler - I don't know the exact version (compiling for use in R requires the current 'standard' set by R). Sorry for my ignorance - what is a backreference? Where am I using them here? Can you recommend a tool for profiling my C++ code?
– Mark
Jan 17 at 14:36













By backreference do you mean the using () and \1 in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed to support using backreferences the same way R does it.
– Mark
Jan 17 at 14:49




By backreference do you mean the using () and \1 in the regex? I'm trying to write a general function that accepts a wide variety of inputs so I would like for them to continue to be supported. I added std::regex_constants::format_sed to support using backreferences the same way R does it.
– Mark
Jan 17 at 14:49












So you're on Windows using a recent mingw version of GCC then, I take it.
– Snowbody
Jan 17 at 14:50




So you're on Windows using a recent mingw version of GCC then, I take it.
– Snowbody
Jan 17 at 14:50












Yes. on Windows. Sorry for missing the most obvious part of that response.
– Mark
Jan 17 at 14:50




Yes. on Windows. Sorry for missing the most obvious part of that response.
– Mark
Jan 17 at 14:50












For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/…
– Snowbody
Jan 18 at 13:39




For profiling MinGW code on Windows, see jonforums.github.io/cochise/2011/07/18/…
– Snowbody
Jan 18 at 13:39












up vote
2
down vote













Headers



You forgot to include <vector>.



Only the test program requires <iostream>, so consider moving it to immediately before main(), to make it easier to separate the implementation and its tests when the time comes.



Interface



You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:



  • Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.

  • The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a std::regex_constants::syntax_option_type to be used; this would also allow the caller to choose different regex grammars.

I think I would write the interface something like this:



std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);




// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););

return rcpp_mgsub(string, replacements);



Algorithm



After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.



I have the beginnings of an implementation that does this, but it fails at the ++it; line. I'll post it here while I continue debugging:



#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>

std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)

static const std::sregex_iterator no_match = ;

using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;

std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);

std::string result = ;
auto position = s.begin();

while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)

// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;


// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;




Compilation



The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main() to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):



#include <iostream>

int main()

std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";

std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);

std::cout << s << std::endl ;



I found a large difference between g++ -O0 and g++ -O3 on this code (roughly a factor of 10×). Quite a large part of this program comes from expanding templates from the <regex> header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.



Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.






share|improve this answer























  • wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
    – Mark
    Jan 25 at 19:48











  • As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
    – Toby Speight
    Jan 26 at 11:11










  • Very helpful! Thank you!
    – Mark
    Jan 26 at 11:14














up vote
2
down vote













Headers



You forgot to include <vector>.



Only the test program requires <iostream>, so consider moving it to immediately before main(), to make it easier to separate the implementation and its tests when the time comes.



Interface



You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:



  • Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.

  • The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a std::regex_constants::syntax_option_type to be used; this would also allow the caller to choose different regex grammars.

I think I would write the interface something like this:



std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);




// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););

return rcpp_mgsub(string, replacements);



Algorithm



After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.



I have the beginnings of an implementation that does this, but it fails at the ++it; line. I'll post it here while I continue debugging:



#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>

std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)

static const std::sregex_iterator no_match = ;

using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;

std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);

std::string result = ;
auto position = s.begin();

while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)

// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;


// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;




Compilation



The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main() to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):



#include <iostream>

int main()

std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";

std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);

std::cout << s << std::endl ;



I found a large difference between g++ -O0 and g++ -O3 on this code (roughly a factor of 10×). Quite a large part of this program comes from expanding templates from the <regex> header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.



Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.






share|improve this answer























  • wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
    – Mark
    Jan 25 at 19:48











  • As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
    – Toby Speight
    Jan 26 at 11:11










  • Very helpful! Thank you!
    – Mark
    Jan 26 at 11:14












up vote
2
down vote










up vote
2
down vote









Headers



You forgot to include <vector>.



Only the test program requires <iostream>, so consider moving it to immediately before main(), to make it easier to separate the implementation and its tests when the time comes.



Interface



You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:



  • Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.

  • The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a std::regex_constants::syntax_option_type to be used; this would also allow the caller to choose different regex grammars.

I think I would write the interface something like this:



std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);




// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););

return rcpp_mgsub(string, replacements);



Algorithm



After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.



I have the beginnings of an implementation that does this, but it fails at the ++it; line. I'll post it here while I continue debugging:



#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>

std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)

static const std::sregex_iterator no_match = ;

using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;

std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);

std::string result = ;
auto position = s.begin();

while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)

// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;


// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;




Compilation



The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main() to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):



#include <iostream>

int main()

std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";

std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);

std::cout << s << std::endl ;



I found a large difference between g++ -O0 and g++ -O3 on this code (roughly a factor of 10×). Quite a large part of this program comes from expanding templates from the <regex> header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.



Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.






share|improve this answer















Headers



You forgot to include <vector>.



Only the test program requires <iostream>, so consider moving it to immediately before main(), to make it easier to separate the implementation and its tests when the time comes.



Interface



You might be constrained by what the calling environment expects (I don't know much about R), but there are a couple of surprises in the function signature:



  • Passing the match strings and their replacements as a pair of parallel containers can be hard to get right (and there doesn't seem to be even the minimum of checking that their lengths match). It's better to accept a list of pairs than a pair of lists; that way, each match appears alongside its replacement.

  • The boolean flag is a danger sign - it's not obvious at the call site what the flag means. It might be better to accept a std::regex_constants::syntax_option_type to be used; this would also allow the caller to choose different regex grammars.

I think I would write the interface something like this:



std::string rcpp_mgsub(std::string string,
std::vector<std::pair<std::regex,std::string>> const& replacements);




// Compatibility layer, if required
std::string rcpp_mgsub(const std::string& string,
std::vector<std::string> const& match,
std::vector<std::string> const& replace,
bool const& ic)
= std::regex_constants::icase;
std::vector<std::pair<std::regex,std::string>> replacements;
replacements.reserve(match.size());
std::transform(match.begin(), match.end(), replace.begin(), std::back_inserter(replacements),
[&flags](const std::string& m, const std::string& r) return std::make_pair(std::regex(m, flags), r););

return rcpp_mgsub(string, replacements);



Algorithm



After each replacement, every regex is re-searched from the last match. If we remembered where each one matched, we'd only need to update the matches for any that matched before the text just substituted. This may save a great deal of processing, particularly for regexes that are unmatched and for long input strings.



I have the beginnings of an implementation that does this, but it fails at the ++it; line. I'll post it here while I continue debugging:



#include <regex>
#include <stdexcept>
#include <string>
#include <utility>
#include <vector>

std::string rcpp_mgsub(const std::string& s, std::vector<std::pair<std::regex,std::string>> const& replacements)

static const std::sregex_iterator no_match = ;

using IterAndReplacement = std::pair<std::sregex_iterator,const std::string>;

std::vector<IterAndReplacement> iterators;
iterators.reserve(replacements.size());
for (auto const r: replacements)
iterators.emplace_back(std::sregex_iterators.begin(), s.end(), r.first, r.second);

std::string result = ;
auto position = s.begin();

while (true)
// find the next match, ignoring any shorter overlapping matches
IterAndReplacement const *best_match = nullptr;
for (auto& i: iterators)

// if no regex matches, just copy the rest of string and finish
if (!best_match)
result.append(position, s.end());
return result;


// otherwise, replace the match and continue to the next one
auto const& best = (*best_match->first);
auto const m = best.format(best_match->second, std::regex_constants::format_sed);
result.append(position, best[0].first).append(m);
position = best[0].second;




Compilation



The question is tagged performance, but there's no indication of how you're conducting performance tests. I adapted main() to transform a string (using a replacement and its inverse) twenty thousand times, and to use the result (to avoid over-optimizing):



#include <iostream>

int main()

std::cout << rcpp_mgsub("Hey hey hey ho ho Ho",
"hey","ho" ,
"ho", "hey", true) << "n";
std::cout << rcpp_mgsub("Hey, how are you?",
"hey","how","are","you",
"how","are","you","hey", true) << "n";
std::cout << rcpp_mgsub("Dopazamine is not the same as Dopachloride and is still fake.",
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false) << "n";

std::string s = "Dopazamine is not the same as Dopachloride and is still fake.";
for (auto i = 0u; i < 10000; ++i)
s = rcpp_mgsub(s,
"[Dd]opa(.*?mine)", "fake",
"Meta\1", "real", false);
s = rcpp_mgsub(s,
"Meta(.*?mine)", "fake",
"Dopa\1", "real", false);

std::cout << s << std::endl ;



I found a large difference between g++ -O0 and g++ -O3 on this code (roughly a factor of 10×). Quite a large part of this program comes from expanding templates from the <regex> header (therefore compiled as part of this translation unit, with our compiler). And there's quite a lot that can be inlined or removed by a link-time optimizer.



Remember: when making performance-related changes to code, always measure before and after - and make sure that what you're measuring is representative! If you carefully measure and profile the unoptimized builds, you may find that you're sacrificing readability for no improvement on what the optimizing compiler produces.







share|improve this answer















share|improve this answer



share|improve this answer








edited Jan 30 at 10:04


























answered Jan 25 at 18:44









Toby Speight

17.8k13491




17.8k13491











  • wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
    – Mark
    Jan 25 at 19:48











  • As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
    – Toby Speight
    Jan 26 at 11:11










  • Very helpful! Thank you!
    – Mark
    Jan 26 at 11:14
















  • wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
    – Mark
    Jan 25 at 19:48











  • As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
    – Toby Speight
    Jan 26 at 11:11










  • Very helpful! Thank you!
    – Mark
    Jan 26 at 11:14















wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
– Mark
Jan 25 at 19:48





wow, this is great. Thanks! So, the implementation in R would have checking of equal length match/replacements in R before calling the C++ function (just how the implementation works). In fact, all my error checking would be handled in R and the C++ function would be an internal call. I'm not 100% clear on how R objects are converted to C++ objects when they're passed in, but vectors and scalars definitely work. Also, the R function would have documentation on what the boolean flag would actually do.
– Mark
Jan 25 at 19:48













As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
– Toby Speight
Jan 26 at 11:11




As I don't know R, and you didn't include the R interface, I've just reviewed this like any other C++ code; I hope that's still helpful.
– Toby Speight
Jan 26 at 11:11












Very helpful! Thank you!
– Mark
Jan 26 at 11:14




Very helpful! Thank you!
– Mark
Jan 26 at 11:14












 

draft saved


draft discarded


























 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f185312%2fsafe-simultaneous-string-replacement%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Chat program with C++ and SFML

Function to Return a JSON Like Objects Using VBA Collections and Arrays

Will my employers contract hold up in court?