Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Make "findall" work on a string #31788

Closed
AhmedSalih3d opened this issue Apr 21, 2019 · 9 comments · Fixed by #31834
Closed

Suggestion: Make "findall" work on a string #31788

AhmedSalih3d opened this issue Apr 21, 2019 · 9 comments · Fixed by #31834
Labels
domain:strings "Strings!"

Comments

@AhmedSalih3d
Copy link

Hello all

Currently it is possible in Julia:

x = [1 1 3 4 1]
findall(x->x==1,vec(x))
3-element Array{Int64,1}:
1
2
5

If I use a string array:, it will still work:

x = ["1" "1" "3" "4" "1"]
findall(x->x=="1",vec(x))
3-element Array{Int64,1}:
1
2
5

But if I instead have one complete string ie;

x = "11341"
findall(x->x=="1",vec(x)) (Doesn't work will give an error)
There doesn't seem to be any easy function/way to go about this, so I am "missing" a findall feature, instead of having to split a string into substrings/chars.

Inspired by these posts here (made by me earlier on Discourse):

https://discourse.julialang.org/t/find-index-of-all-occurences-of-a-string/23044
https://discourse.julialang.org/t/suggestion-findall-to-work-on-strings/23143

Kind regards

@stevengj
Copy link
Member

stevengj commented Apr 21, 2019

Since we have findfirst(substring, string) and findnext(substring, string, i) returning UnitRange{Int}, it would be natural to have findall(substring, string). A natural definition seems like it could be something like:

function findall(t::Union{AbstractString,Regex}, s::AbstractString)
    found = UnitRange{Int}[]
    i = firstindex(s)
    while true
        r = findnext(t, s, i)
        isnothing(r) && return found
        push!(found, r)
        i = nextind(s, last(r))
    end
end

which gives output like

julia> findall("ing", "Spinning laughing dancing")
3-element Array{UnitRange{Int64},1}:
 6:8  
 15:17
 23:25

julia> findall(r"\w+", "Spinning laughing dancing")
3-element Array{UnitRange{Int64},1}:
 1:8  
 10:17
 19:25

@stevengj stevengj added the domain:strings "Strings!" label Apr 21, 2019
@stevengj
Copy link
Member

stevengj commented Apr 21, 2019

Note that my implementation above does not terminate for findall("", "foo") and similar (cases where isempty(r) is true)— it's not entirely clear to me what the desired result is in such cases?

@stevengj
Copy link
Member

stevengj commented Apr 21, 2019

To handle the empty-range case, maybe :

function findall(t::Union{AbstractString,Regex}, s::AbstractString)
    found = UnitRange{Int}[]
    i = firstindex(s)
    while true
        r = findnext(t, s, i)
        isnothing(r) && return found
        push!(found, r)
        j = isempty(r) ? first(r) : last(r)
        j > lastindex(s) && return found
        @inbounds i = nextind(s, j)
    end
end

which gives e.g.

julia> findall("", "foo")
4-element Array{UnitRange{Int64},1}:
 1:0
 2:1
 3:2
 4:3

julia> findall(r"\w*", "Spinning laughing dancing")
6-element Array{UnitRange{Int64},1}:
 1:8  
 9:8  
 10:17
 18:17
 19:25
 26:25

@stevengj
Copy link
Member

It's also not clear to me what the desired behavior is for overlapping substrings? Do we want to find only disjoint substrings (the code above), or do we want all substrings even if they are overlapping?

For the latter behavior, just change my code above to j = first(r). In this case, you get e.g.:

julia> findall(r"\w*", "Spinning laughing dancing")
26-element Array{UnitRange{Int64},1}:
 1:8  
 2:8  
 3:8  
 4:8  
 5:8  
 6:8  
 7:8  
 8:8  
 9:8  
 10:17
 11:17
 12:17
 13:17
 14:17
 15:17
 16:17
 17:17
 18:17
 19:25
 20:25
 21:25
 22:25
 23:25
 24:25
 25:25
 26:25

I can imagine both possibilities being useful. Maybe a disjoint::Bool=true keyword?

@fredrikekre
Copy link
Member

fredrikekre commented Apr 21, 2019

eachmatch(::Regex, ::AbstractString) has a overlap::Bool=false kwarg, so seems natural to reuse that here too.

@AhmedSalih3d
Copy link
Author

Personally, I like @fredrikekre suggestion with reusing the overlap::Bool=false, to make it more useable, but personally for your findall("", "foo") @stevengj , I would prefer empty result (not an error, just "nothing"), since it would make more sense for an user than having all indices of each string as in your example here:

julia> findall("", "foo") 4-element Array{UnitRange{Int64},1}: 1:0 2:1 3:2 4:3

So maybe instead just an empty UnitRange array?

Thirdly, I just wondered why would you sometimes get ranges from 1:8 and then 9:8, is it also reading the string backwards? Maybe this should be a optional kwarg as well.

Just suggestions from a users point of view.

Kind regards

@stevengj
Copy link
Member

stevengj commented Apr 23, 2019

@AhmedSalih3d, a range like 9:8 is a standard way in Julia to specify an empty range "starting" at 9, i.e. the range "between " indices 8 and 9. Empty ranges like 9:8 are a valid answer above because r"\w*" matches zero-or-more "word" characters. Another example is:

julia> findnext(r"\b", "foo bar", 2)
4:3

because the first word boundary after position 2 is between indices 4 and 3 in this string.

Although the result my code returns for findall("", "foo") may seem weird, I think it is consistent with this, whereas returning an empty array would not be — if you ask a weird question, you get a weird answer.

@stevengj
Copy link
Member

stevengj commented Apr 23, 2019

So, in summary, it seems that we want:

function findall(t::Union{AbstractString,Regex}, s::AbstractString; overlap::Bool=false)
    found = UnitRange{Int}[]
    i, e = firstindex(s), lastindex(s)
    while true
        r = findnext(t, s, i)
        isnothing(r) && return found
        push!(found, r)
        j = overlap || isempty(r) ? first(r) : last(r)
        j > e && return found
        @inbounds i = nextind(s, j)
    end
end

or equivalent?

Anyone who wants to put together a PR with some tests and documentation is welcome to grab this code.

@StefanKarpinski
Copy link
Sponsor Member

Yes, that looks correct to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!"
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants