Skip to content

Commit

Permalink
remove RevString; efficient generic reverseind
Browse files Browse the repository at this point in the history
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
  • Loading branch information
StefanKarpinski committed Nov 28, 2017
1 parent c2feee7 commit ecc0bf0
Show file tree
Hide file tree
Showing 14 changed files with 96 additions and 146 deletions.
11 changes: 11 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,12 @@ This section lists changes that do not have deprecation warnings.
`AbstractArray` types that specialized broadcasting using the old internal API will
need to switch to the new API. ([#20740])

* The `RevString` type has been removed from the language; `reverse(::String)` returns
a `String` with code points (or fragments thereof) in reverse order. In general,
`reverse(s)` should return a string of the same type and encoding as `s` with code
points in reverse order; any string type overrides `reverse` to return a different
type of string must also override `reverseind` to compute reversed indices correctly.

Library improvements
--------------------

Expand Down Expand Up @@ -397,6 +403,11 @@ Library improvements
The generic definition is constant time but calls `endof(s)` which may be inefficient.
Therefore custom string types may want to define direct `ncodeunits` methods.

* `reverseind(s::AbstractString, i::Integer)` now has an efficient generic fallback, so
custom string types do not need to provide their own efficient defintions. The generic
definition relies on `ncodeunits` however, so for optimal performance you may need to
define a custom method for that function.

Compiler/Runtime improvements
-----------------------------

Expand Down
1 change: 0 additions & 1 deletion base/exports.jl
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,6 @@ export
Rational,
Regex,
RegexMatch,
RevString,
RoundFromZero,
RoundDown,
RoundingMode,
Expand Down
3 changes: 0 additions & 3 deletions base/precompile.jl
Original file line number Diff line number Diff line change
Expand Up @@ -578,9 +578,6 @@ precompile(Tuple{typeof(Base.LineEdit.complete_line), Base.LineEdit.PromptState,
precompile(Tuple{typeof(Base.LineEdit.input_string_newlines_aftercursor), Base.LineEdit.PromptState})
precompile(Tuple{typeof(Base.LineEdit.complete_line), Base.REPL.REPLCompletionProvider, Base.LineEdit.PromptState})
precompile(Tuple{getfield(Base, Symbol("#kw##parse")), Array{Any, 1}, typeof(Base.parse), String})
precompile(Tuple{typeof(Base.isvalid), Base.RevString{String}, Int64})
precompile(Tuple{typeof(Base.nextind), Base.RevString{String}, Int64})
precompile(Tuple{typeof(Base.search), Base.RevString{String}, Array{Char, 1}, Int64})
precompile(Tuple{typeof(Base.rsearch), String, Array{Char, 1}, Int64})
precompile(Tuple{getfield(Base.REPLCompletions, Symbol("#kw##find_start_brace")), Array{Any, 1}, typeof(Base.REPLCompletions.find_start_brace), String})
precompile(Tuple{typeof(Core.Inference.isbits), Tuple{Void, Void, Void}})
Expand Down
4 changes: 2 additions & 2 deletions base/repl/REPLCompletions.jl
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ end
# closed start brace from the end of the string.
function find_start_brace(s::AbstractString; c_start='(', c_end=')')
braces = 0
r = RevString(s)
r = reverse(s)
i = start(r)
in_single_quotes = false
in_double_quotes = false
Expand Down Expand Up @@ -259,7 +259,7 @@ function find_start_brace(s::AbstractString; c_start='(', c_end=')')
braces == 1 && break
end
braces != 1 && return 0:-1, -1
method_name_end = reverseind(r, i)
method_name_end = reverseind(s, i)
startind = nextind(s, rsearch(s, non_identifier_chars, method_name_end))
return (startind:endof(s), method_name_end)
end
Expand Down
2 changes: 1 addition & 1 deletion base/shell.jl
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ function shell_parse(str::AbstractString, interpolate::Bool=true;
special::AbstractString="")
s = lstrip(str)
# strips the end but respects the space when the string ends with "\\ "
r = RevString(s)
r = reverse(s)
i = start(r)
c_old = nothing
while !done(r,i)
Expand Down
52 changes: 26 additions & 26 deletions base/strings/search.jl
Original file line number Diff line number Diff line change
Expand Up @@ -194,12 +194,6 @@ end
search(s::AbstractString, t::AbstractString, i::Integer=start(s)) = _search(s, t, i)
search(s::ByteArray, t::ByteArray, i::Integer=start(s)) = _search(s, t, i)

function rsearch(s::AbstractString, c::Chars)
j = search(RevString(s), c)
j == 0 && return 0
endof(s)-j+1
end

"""
rsearch(s::AbstractString, chars::Chars, [start::Integer])
Expand All @@ -212,44 +206,50 @@ julia> rsearch("aaabbb","b")
6:6
```
"""
function rsearch(s::AbstractString, c::Chars, i::Integer)
e = endof(s)
j = search(RevString(s), c, e-i+1)
j == 0 && return 0
e-j+1
function rsearch(s::AbstractString, c::Chars, i::Integer=start(s))
if i < 1
return i == 0 ? 0 : throw(BoundsError(s, i))
end
n = ncodeunits(s)
if i > n
return i == n+1 ? 0 : throw(BoundsError(s, i))
end
# r[reverseind(r,i)] == reverse(r)[i] == s[i]
# s[reverseind(s,j)] == reverse(s)[j] == r[j]
r = reverse(s)
j = search(r, c, reverseind(r, i))
j == 0 ? 0 : reverseind(s, j)
end

function _rsearchindex(s, t, i)
if isempty(t)
return 1 <= i <= nextind(s,endof(s)) ? i :
return 1 <= i <= nextind(s, endof(s)) ? i :
throw(BoundsError(s, i))
end
t = RevString(t)
rs = RevString(s)
t = reverse(t)
rs = reverse(s)
l = endof(s)
t1, j2 = next(t,start(t))
t1, j2 = next(t, start(t))
while true
i = rsearch(s,t1,i)
if i == 0 return 0 end
c, ii = next(rs,l-i+1)
i = rsearch(s, t1, i)
i == 0 && return 0
c, ii = next(rs, reverseind(rs, i))
j = j2; k = ii
matched = true
while !done(t,j)
if done(rs,k)
while !done(t, j)
if done(rs, k)
matched = false
break
end
c, k = next(rs,k)
d, j = next(t,j)
c, k = next(rs, k)
d, j = next(t, j)
if c != d
matched = false
break
end
end
if matched
return nextind(s,l-k+1)
end
i = l-ii+1
matched && return nextind(s, reverseind(s, k))
i = reverseind(s, ii)
end
end

Expand Down
40 changes: 0 additions & 40 deletions base/strings/string.jl
Original file line number Diff line number Diff line change
Expand Up @@ -295,14 +295,6 @@ function first_utf8_byte(ch::Char)
return b
end

function reverseind(s::String, i::Integer)
j = sizeof(s) + 1 - i
@inbounds while is_valid_continuation(codeunit(s, j))
j -= 1
end
return j
end

## overload methods for efficiency ##

isvalid(s::String, i::Integer) =
Expand Down Expand Up @@ -477,38 +469,6 @@ function string(a::Union{String,Char}...)
return out
end

function reverse(s::String)
dat = Vector{UInt8}(s)
n = length(dat)
n <= 1 && return s
buf = StringVector(n)
out = n
pos = 1
@inbounds while out > 0
ch = dat[pos]
if ch > 0xdf
if ch < 0xf0
(out -= 3) < 0 && throw(UnicodeError(UTF_ERR_SHORT, pos, ch))
buf[out + 1], buf[out + 2], buf[out + 3] = ch, dat[pos + 1], dat[pos + 2]
pos += 3
else
(out -= 4) < 0 && throw(UnicodeError(UTF_ERR_SHORT, pos, ch))
buf[out+1], buf[out+2], buf[out+3], buf[out+4] = ch, dat[pos+1], dat[pos+2], dat[pos+3]
pos += 4
end
elseif ch > 0x7f
(out -= 2) < 0 && throw(UnicodeError(UTF_ERR_SHORT, pos, ch))
buf[out + 1], buf[out + 2] = ch, dat[pos + 1]
pos += 2
else
buf[out] = ch
out -= 1
pos += 1
end
end
String(buf)
end

function repeat(s::String, r::Integer)
r < 0 && throw(ArgumentError("can't repeat a string $r times"))
n = sizeof(s)
Expand Down
2 changes: 1 addition & 1 deletion base/strings/strings.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# This file is a part of Julia. License is MIT: https://julialang.org/license

include("strings/errors.jl")
include("strings/types.jl")
include("strings/substring.jl")
include("strings/basic.jl")
include("strings/search.jl")
include("strings/util.jl")
Expand Down
55 changes: 21 additions & 34 deletions base/strings/types.jl → base/strings/substring.jl
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
# This file is a part of Julia. License is MIT: https://julialang.org/license

# SubString and RevString types

## substrings reference original strings ##

"""
SubString(s::AbstractString, i::Integer, j::Integer=endof(s))
SubString(s::AbstractString, r::UnitRange{<:Integer})
Expand Down Expand Up @@ -51,6 +47,9 @@ end
SubString(s::AbstractString) = SubString(s, 1, endof(s))
SubString{T}(s::T) where {T<:AbstractString} = SubString{T}(s, 1, endof(s))

convert(::Type{SubString{S}}, s::AbstractString) where {S<:AbstractString} =
SubString(convert(S, s))

String(p::SubString{String}) =
unsafe_string(pointer(p.string, p.offset+1), nextind(p, p.endof)-1)

Expand Down Expand Up @@ -123,32 +122,18 @@ function unsafe_convert(::Type{Ptr{R}}, s::SubString{String}) where R<:Union{Int
convert(Ptr{R}, pointer(s.string)) + s.offset
end

## reversed strings without data movement ##

struct RevString{T<:AbstractString} <: AbstractString
string::T
end

endof(s::RevString) = endof(s.string)
length(s::RevString) = length(s.string)
sizeof(s::RevString) = sizeof(s.string)

function next(s::RevString, i::Int)
n = endof(s); j = n-i+1
(s.string[j], n-prevind(s.string,j)+1)
end

"""
reverse(s::AbstractString) -> AbstractString
Reverses a string.
Technically, this function reverses the codepoints in a string, and its
Reverses a string. Technically, this function reverses the codepoints in a string and its
main utility is for reversed-order string processing, especially for reversed
regular-expression searches. See also [`reverseind`](@ref) to convert indices
in `s` to indices in `reverse(s)` and vice-versa, and [`graphemes`](@ref)
to operate on user-visible "characters" (graphemes) rather than codepoints.
See also [`Iterators.reverse`](@ref) for reverse-order iteration without making a copy.
regular-expression searches. See also [`reverseind`](@ref) to convert indices in `s` to
indices in `reverse(s)` and vice-versa, and [`graphemes`](@ref) to operate on user-visible
"characters" (graphemes) rather than codepoints. See also [`Iterators.reverse`](@ref) for
reverse-order iteration without making a copy. Custom string types must implement the
`reverse` function themselves and should typically return a string with the same type
and encoding. If they return a string with a different encoding, they must also override
`reverseind` for that string type to satisfy `s[reverseind(s,i)] == reverse(s)[i]`.
# Examples
```jldoctest
Expand All @@ -162,10 +147,15 @@ julia> join(reverse(collect(graphemes("ax̂e")))) # reverses graphemes
"ex̂a"
```
"""
reverse(s::AbstractString) = RevString(s)
reverse(s::RevString) = s.string

## reverse an index i so that reverse(s)[i] == s[reverseind(s,i)]
function reverse(s::Union{String,SubString{String}})::String
sprint() do io
i, j = start(s), endof(s)
while i j
c, j = s[j], prevind(s, j)
write(io, c)
end
end
end

"""
reverseind(v, i)
Expand All @@ -185,10 +175,7 @@ julia> for i in 1:length(r)
Julia
```
"""
reverseind(s::AbstractString, i) = chr2ind(s, length(s) + 1 - ind2chr(reverse(s), i))
reverseind(s::RevString, i::Integer) = endof(s) - i + 1
reverseind(s::SubString{String}, i::Integer) =
reverseind(s.string, nextind(s.string, endof(s.string))-s.offset-s.endof+i-1) - s.offset
reverseind(s::AbstractString, i::Integer) = thisind(s, ncodeunits(s)-i+1)

"""
repeat(s::AbstractString, r::Integer)
Expand Down
15 changes: 7 additions & 8 deletions base/strings/util.jl
Original file line number Diff line number Diff line change
Expand Up @@ -190,16 +190,15 @@ julia> rstrip(a)
```
"""
function rstrip(s::AbstractString, chars::Chars=_default_delims)
r = RevString(s)
i = start(r)
while !done(r,i)
c, j = next(r,i)
if !(c in chars)
return SubString(s, 1, endof(s)-i+1)
end
a = start(s)
i = endof(s)
while a  i
c = s[i]
j = prevind(s, i)
c in chars || return SubString(s, 1:i)
i = j
end
SubString(s, 1, 0)
SubString(s, a, a-1)
end

"""
Expand Down
3 changes: 3 additions & 0 deletions stdlib/Test/src/Test.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1384,6 +1384,9 @@ struct GenericString <: AbstractString
end
Base.endof(s::GenericString) = endof(s.string)
Base.next(s::GenericString, i::Int) = next(s.string, i)
Base.reverse(s::GenericString) = GenericString(reverse(s.string))
Base.reverse(s::SubString{GenericString}) =
GenericString(typeof(s.string)(reverse(String(s))))

"""
The `GenericSet` can be used to test generic set APIs that program to
Expand Down
2 changes: 1 addition & 1 deletion test/replcompletions.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
using Base.REPLCompletions

let ex = quote
module CompletionFoo
module CompletionFoo
mutable struct Test_y
yy
end
Expand Down
Loading

0 comments on commit ecc0bf0

Please sign in to comment.