Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement replace on String for multiple patterns #40484

Merged
merged 3 commits into from
Jun 7, 2021

Conversation

vtjnash
Copy link
Sponsor Member

@vtjnash vtjnash commented Apr 14, 2021

This has been attempted before, sometimes fairly similar to this, but the attempts seemed to be either too simple or much too complicated. This aims to be simple, and still seems to beat most of the "handwritten" benchmark cases.

The core of this works by partially unrolling the pattern tests using map on tuple. This seems to outperform nearly all of the performance benchmarks I could find (particularly those of @stevengj that he'd written of past attempts at this), including most of the hand-written ones and especially any that involved using a regex (even a fairly optimally-designed regex). There was one that this was a factor-of-two away, but it is mostly due to the unreliable findnext API design and apparently optimizer deficiencies with unravelling the resulting morass of Unions (it leaves behind some big phi nodes tagged with pi nodes that should have killed them), and not for any particular reason that this should have been discernibly slower.

@vtjnash vtjnash added the domain:strings "Strings!" label Apr 14, 2021
@vtjnash vtjnash requested a review from stevengj April 14, 2021 20:21
vtjnash added a commit to JuliaCI/BaseBenchmarks.jl that referenced this pull request Apr 14, 2021
The multi-replace method is being added in JuliaLang/julia#40484
vtjnash added a commit to JuliaCI/BaseBenchmarks.jl that referenced this pull request Apr 14, 2021
The multi-replace method is being added in JuliaLang/julia#40484
vtjnash added a commit to JuliaCI/BaseBenchmarks.jl that referenced this pull request Apr 14, 2021
The multi-replace method is being added in JuliaLang/julia#40484
@stevengj
Copy link
Member

stevengj commented Apr 15, 2021

Thanks, this is great!

I just tried another benchmark and confirmed that this is faster than a regex approach:

s = "elephant giraffe antelope lion cheetah, "^1000;
const reg = r"elephant|giraffe|antelope|lion|cheetah"
const rep = Dict("elephant"=>"grey", "giraffe"=>"yellow", "antelope"=>"brown", "lion"=>"gold", "cheetah"=>"yellow")

@btime replace($s, $(reg=>(s->rep[s])));
@btime replace($s, "elephant"=>"grey", "giraffe"=>"yellow", "antelope"=>"brown", "lion"=>"gold", "cheetah"=>"yellow"); 

gives 696µs (regex) and 440µs (PR).

However, for single-character replacements:

using Random
sr = randstring(1000);
const upper = Dict(c=>uppercase(c) for c in 'a':'z');

@btime uppercase($sr);
@btime map($(c -> 'a'  c  'z' ? c-32 : c), $sr);
@btime replace($sr, $(r"[a-z]"=>uppercase));
@btime replace($sr, $(Set('a':'z')=>uppercase));
@btime replace($sr, $(r"[a-z]"=>(c->upper[c[1]])));
@btime replace($sr, $(r"a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z"=>(c->upper[c[1]])));
@btime replace($sr, $(Set('a':'z')=>(c->upper[c])));

@btime replace($sr, 'a' => 'A', 'b' => 'B', 'c' => 'C', 'd' => 'D', 'e' => 'E', 'f' => 'F', 'g' => 'G', 'h' => 'H', 'i' => 'I', 'j' => 'J', 'k' => 'K', 'l' => 'L', 'm' => 'M', 'n' => 'N', 'o' => 'O', 'p' => 'P', 'q' => 'Q', 'r' => 'R', 's' => 'S', 't' => 'T', 'u' => 'U', 'v' => 'V', 'w' => 'W', 'x' => 'X', 'y' => 'Y', 'z' => 'Z');

gives 6.2µs, 5.7µs, 71µs, 39µs, 51µs, 60µs, 44µs, and 1.6ms, respectively.

This is a bit more concerning — in particular, for the multi-arg replace I'm getting an excessive number of allocations for the single-char benchmark:

  1.566 ms (20826 allocations: 987.97 KiB)

Is it because the argument tuple is too long?

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Apr 15, 2021

Yeah, at some point tuple operations switch automatically to using Arrays, as that's a policy decision made elsewhere. Though, as you showed, generally map may be more appropriate for such a situations anyways, as it is a similar penalty for map -> set (10x) as it is for set -> many arguments (30x).

@stevengj
Copy link
Member

stevengj commented Apr 15, 2021

Yeah, at some point tuple operations switch automatically to using Arrays, as that's a policy decision made elsewhere.

It would be nice if the implementation could switch to map! in that case so that it can update the array in-place rather than reallocating it each time…

I can easily see people splatting a large number of replacements here.

@stevengj
Copy link
Member

stevengj commented Apr 16, 2021

Of course, if you have a really large number of replacements then you would probably want to use a different data structure for rs, e.g. a heap, and in that case you would probably want to pass the replacement pairs as an iterable collection rather than splatting.

@@ -572,6 +590,13 @@ If `pat` is a regular expression and `r` is a [`SubstitutionString`](@ref), then
references in `r` are replaced with the corresponding matched text.
To remove instances of `pat` from `string`, set `r` to the empty `String` (`""`).

Multiple patterns can be specified, and they will be applied left-to-right
simultaneously, so only one pattern will be applied to any character, and the
patterns will only be applied to the input text, not the replacements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section is misleading - the text seems to sugest only patterns of the form 'x' => "replacement" or 'x' => 'a' are allowed, but the tests further down seem to suggest "original" => "replacement" and "original" => 'a' should also work.

Additionally, what should the result of replace("atest", "atest" => 'a', "at" => 'b') be? Simply "a", since the patterns are applied left to right, even though "simultaneously" applying them leads to inconsistency (and imo both are reasonable choices)?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patterns are "not applied to the replacements". Each character is only matched by at most one pattern, though a pattern may match multiple characters (or even none).

@Seelengrab
Copy link
Contributor

Seelengrab commented Apr 21, 2021

I like the performance, but I'm not convinced the behaviour is best in all cases. I've said it before, but I don't think replace(::String, ::String => x, ::String => y, ...) should exist. Just looking through discourse and reading posts from people looking for this method, you get a bunch of conflicting desired behaviours. Because of that, I think almost all interpretations of how this API should behave are reasonable and making a definitive choice here feels weird to me (and would probably lead to confusion from people expecting it to behave a certain way...). How do other languages handle this and how have they solved those problems?

Consistency with replace(::AbstractArray, pat => rep, pat => rep) doesn't feel like a good argument, since in those cases replacement only ever concerns one index, whereas replace(::String, ::String => x, ::String => y) covers potentially multiple indices, leading to the problems described in the linked post.

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented May 27, 2021

triage thinks this is good

This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. #25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes #35327
fixes #39061
fixes #35414
fixes #29849
fixes #30457
fixes #25396
fixnext API is awkward
@stevengj
Copy link
Member

stevengj commented Jun 7, 2021

How do other languages handle this and how have they solved those problems?

Python doesn't seem to have a function for this. The top link discussing the issue seems to be here on stackoverflow, and the top suggestions are either to make multiple replace passes or to do a single pass with a regex. All note that that the problem is somewhat ambiguous for overlapping matches.

Ruby also lacks such a feature, and the main suggestion seems to be the regex solution (also here or here).

Similarly in Perl, people seem to suggest the regex solution (e.g. here or here). In PostgreSQL, I found a suggestion that people use Perl via an extension and employ the regex solution.

In Rust, I again see people proposing the regex solution. In Swift, I've also seen the regex solution proposed.

In short, questions of how to do multiple replacements in a "single pass" seem to pop up in many languages, but I've yet to find a language that provides a "built-in" solution. In every other language, the "efficient" way to do this is commonly believed to be a regex combined with a match-mapping function. Most people seem to be happy with the semantics of the regex solution, and are generally unconcerned with the question of overlapping matches.

@JeffBezanson
Copy link
Sponsor Member

Good references. So do you think we should hold off on merging this?

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Jun 7, 2021

This currently implements a generalization of the behavior of converting all of the arguments to a single regex, if a regex could tell you which part of the regex was matched. So it seems like positive encouragement for that being the generally accepted workaround for other languages. We already knew Julia was better than them :)

@JeffBezanson
Copy link
Sponsor Member

Yeah it seems good to me, was just wondering what Steven meant since I expect long posts to be complaints 😛

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Jun 7, 2021

And just for amusement, here's a random use that just came to mind:

julia> titlecaseish(s) = replace(s, r"(?<=\s|^)." => uppercase, r"." => lowercase);

julia> titlecaseish("ab c bbb  de   Fajk JEJ's.")
"Ab C Bbb  De   Fajk Jej's."

@JeffBezanson JeffBezanson merged commit 70771b2 into master Jun 7, 2021
@JeffBezanson JeffBezanson deleted the jn/multireplacer branch June 7, 2021 23:02
shirodkara pushed a commit to shirodkara/julia that referenced this pull request Jun 9, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
This has been attempted before, sometimes fairly similar to this, but
the attempts seemed to be either too simple or too complicated. This
aims to be simple, and even beats one of the "handwritten" benchmark
cases.

Past issues (e.g. JuliaLang#25396) have proposed that using Regex may be faster,
but in my tests, this handily bests even simplified regexes. There can
be slow Regexes patterns that can cause this to exhibit O(n^2) behavior,
but only if the one of the earlier patterns is a partial match for a
later pattern Regex and that Regex always matches O(n) of the input
stream. This is a case that is hopefully usually avoidable in practice.

fixes JuliaLang#35327
fixes JuliaLang#39061
fixes JuliaLang#35414
fixes JuliaLang#29849
fixes JuliaLang#30457
fixes JuliaLang#25396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants