Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up search and find API #24673

Merged
merged 4 commits into from
Jan 4, 2018
Merged

Clean up search and find API #24673

merged 4 commits into from
Jan 4, 2018

Conversation

nalimilan
Copy link
Member

This is a first step towards deprecating search in favor of findfirst and findnext (#10593). The same changes need to be applied to rsearch, but I figured it would make sense to discuss the design before since issues are similar.

Remarks/points to discuss:

  • Instead of allowing findnext(::Char, ::AbstractString, idx) to replace the equivalent search method, the PR requires doing findnext(equalto(c), haystack, idx). This can easily be changed later, but better be strict in 1.0. In the same vein, to find the first char from a set of possibilities, one needs to do findnext(c in chars, haystack, idx); passing chars directly is deprecated. findnext and findfirst are therefore more restrictive than split and replace, which accept a single char or an array of chars, but I think this difference is justified because find* functions are much more general/complex.
  • On the other hand, findnext(::Union{Regex,AbstractString}, ::AbstractString, idx) is supported even though it has no equivalent for arrays, because it's unambiguous. We could require something like findnext(seq(needle), haystack), which would also work for arrays, but that sounds too strict to me. Note that this method is an exception anyway since it returns a range of indices rather than a single index.

@nalimilan nalimilan added the domain:search & find The find* family of functions label Nov 20, 2017
@nalimilan nalimilan force-pushed the nl/search branch 2 times, most recently from 7e109ce to e4cc7b6 Compare November 21, 2017 18:09
# FIXME: no replacement to search for a multibyte char in a ByteArray
@deprecate search(a::Union{String,ByteArray}, b::Union{Int8,UInt8}, i::Integer = 1) findnext(equalto(b), a, i)
@deprecate search(a::String, b::Union{Int8,UInt8}, i::Integer = 1) findnext(equalto(Char(b)), a, i)
@deprecate search(a::ByteArray, b::Char, i::Integer = 1) findnext(equalto(Unt8(b)), a, i)
Copy link
Sponsor Member

@JeffBezanson JeffBezanson Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo Unt8

@deprecate search(s::String, c::Char) findfirst(equalto(c), s)
# FIXME: no replacement to search for a multibyte char in a ByteArray
@deprecate search(a::Union{String,ByteArray}, b::Union{Int8,UInt8}, i::Integer = 1) findnext(equalto(b), a, i)
@deprecate search(a::String, b::Union{Int8,UInt8}, i::Integer = 1) findnext(equalto(Char(b)), a, i)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is breaking since the old behavior was to search for bytes. However we might want to deprecate this with no replacement since it doesn't really make sense to search for a UInt8 in a String.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these methods were really a mess. Some were probably supposed to be internal, but were de facto exported, and were too flexible regarding the accepted types. It's easy and efficient to do Vector{UInt8}(s) if you want to look for a byte.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, maybe better keep a semi-working deprecation than no deprecation at all? It's quite frequent to look for ASCII characters, for which the old method looking for bytes worked.

@deprecate search(a::ByteArray, b::Char, i::Integer = 1) findnext(equalto(Unt8(b)), a, i)

@deprecate search(s::AbstractString, c::Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}, i::Integer) findnext(x -> x in c, s, i)
@deprecate search(s::AbstractString, c::Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}) findfirst(x -> x in c, s, i)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding occursin(c) for this function?

@JeffBezanson
Copy link
Sponsor Member

👍 Good, I like where this is going!

I believe findfirst and findnext currently only return single indices. If so, we should probably keep it that way and only use them to deprecate searchindex for substrings.

@nalimilan
Copy link
Member Author

I believe findfirst and findnext currently only return single indices. If so, we should probably keep it that way and only use them to deprecate searchindex for substrings.

Right. Though the exception can be justified, since string and regex matches can span over several entries. We could extend this idea to any sequence of elements, e.g. to look for a vector inside another vector.

Do we really need searchindex? AFAICT to determine whether a substring matches, you need to go over it in full anyway, so you know where it ends and there should be no overhead to compute the end of the range.

@ararslan ararslan added domain:collections Data structures holding multiple items, e.g. sets kind:deprecation This change introduces or involves a deprecation labels Nov 21, 2017
@StefanKarpinski
Copy link
Sponsor Member

Has using nothing as "not found" sentinel return value been discussed? That seems in line with what where we're going in the data ecosystem.

@nalimilan
Copy link
Member Author

nalimilan commented Nov 22, 2017

Has using nothing as "not found" sentinel return value been discussed? That seems in line with what where we're going in the data ecosystem.

Yes, I think we need to discuss that, but that will have to be a separate PR since it doesn't affect only strings.

EDIT: see JuliaLang/Juleps#47

@nalimilan
Copy link
Member Author

I've added commits deprecating rsearch and adding occursin. So I think it's mostly ready now. We just need to decide whether we should change the behavior of findnext(::String, ::String) and similar functions (see my comment). This will also determine what to do with searchindex and rsearchindex.

@nalimilan
Copy link
Member Author

Bump: should findfirst/findlast return a range (current state of the PR) or an index to the beginning of the match? See my comments above. I'd like to merge this PR so that I can progress on other search & find issues.

@timholy
Copy link
Sponsor Member

timholy commented Dec 3, 2017

I haven't had time to follow this, but my 2c is that if using a range in strings opens up the possibility of using a range for AbstractArray, that might be a good way to solve the "not found" issue (return an empty range rather than 0 as a sentinel) in a way that plays nicely with arbitrary indices.

@nalimilan
Copy link
Member Author

@timholy That's actually a different issue. findfirst returns a range in the current state of the PR when looking for a string/regex within a string, because the matching substring is a sequences of entries (characters). I'm not proposing that looking e.g. for an Int within a Vector{Int} or for a Char in a String returns a range, because the end of the range adds no information. For cases where findfirst currently returns 0, I think the only natural/practical approach that generalizes is to return nothing (JuliaLang/Juleps#47).

BTW, an argument for findfirst(::String, ::String) to return a range (current state of the PR) is that if we return only the first index of the match, people will end up computing the end of the match by hand via things like start + length(haystack), which will work in ASCII but not in general. So better return the full information by default from the most obvious function people will try.

@timholy
Copy link
Sponsor Member

timholy commented Dec 3, 2017

For cases where findfirst currently returns 0, I think the only natural/practical approach that generalizes is to return nothing (JuliaLang/Juleps#47).

Very reasonable, esp. since union-splitting keeps the performance hit low. My point was simply that returning n:n for a hit and 1:0 for a miss also works for the array case, and if a range is better for strings then perhaps this is an argument for not making you change more stuff here 🙂. I haven't benchmarked union-splitting vs a type-stable branch-on-value, if they are the same then performance isn't a reason to decide one way or another (but might be if they are not the same).

@kmsquire
Copy link
Member

As a point of reference, searchsorted currently returns a range, and when the element being searched for isn’t present, it returns and empty range indicating the insertion point.

A long time ago (probably in Julia 0.3), I did some ad hock benchmarking, and there was some overhead in creating the range compared with returning a single integer. However, the compiler has changed dramatically since then.

It’s also unclear how useful it is to always get the range of indices containing a particular value, vs, e.g., just getting the first or last insertion point.

@nalimilan
Copy link
Member Author

nalimilan commented Dec 11, 2017

See #24883 about searchsorted* functions.

@c42f
Copy link
Member

c42f commented Jan 1, 2018

Yes, don't hold up on my accord. I'm not thrilled by contains, but it's quite possible I was abusing the intent of ismatch previously (even though the name itself seemed good).

I do wonder whether it would be good to have a general verb for pattern matching. contains seems very specific to the string case.

@StefanKarpinski
Copy link
Sponsor Member

I would rather merge this in its current form as a massive improvement in consistency and then debate some of the finer points after we've merged. We're not going to get this 100% right and that's ok, we'll learn some lessons during 1.x and do even better in 2.0.

@stevengj
Copy link
Member

stevengj commented Feb 3, 2018

It would be good to have Compat support for this, e.g. to replace rsearch calls...

@nalimilan
Copy link
Member Author

Yeah, that's quite a mess to support given all the changes. I've filed JuliaLang/Compat.jl#484 to implement most of the changes made here, but some from other PRs will have to be addressed separately.

@andyferris
Copy link
Member

I'm late to the party, but I've been thinking about equalto and occursin.

  • These are constructors for EqualTo a OccursIn. I thought we had stopped making special lower-case constructors in favor of actual type constructors (so long as the constructor makes sense, which I believe it does in this case).
  • If the interface will be uppercase types, why not have IsEqual and In? Having the same name as the corresponding function would help discoverability and readability.

PS I do love the direction this PR has taken us in. I'm thinking using and reading stuff like getting all the indices of the NaNs in a vector via findall(IsEqual(NaN), vector_of_floats) and getting all the points within a given sphere via filter(In(sphere), vector_of_points) is reasonably nice.

As it stands I don't really love occursin and I naively assumed equalto would use == not isequal.

@nalimilan
Copy link
Member Author

These are constructors for EqualTo a OccursIn. I thought we had stopped making special lower-case constructors in favor of actual type constructors (so long as the constructor makes sense, which I believe it does in this case).

Do you have a reference? We still use lowercase constructors systematically in Iterators. I think the idea is that the type is essentially an implementation detail.

I kind of agree the names aren't great, for example we use the third person for occursin but we don't say equals. It was proposed to just use in(x), which would also work for isequal(x), but I was concerned about the consistency with other functions: where should we stop adding currying methods? We could also introduce a subset of #24990 with special cases for these two functions (which is needed for efficient dispatch), making _ in x create a OccursIn(x) object and isequal(_, x)/_ == x create an EqualTo(x) object.

@stevengj
Copy link
Member

stevengj commented Feb 13, 2018

Rather than having special lowering for _ in x, maybe f(_,x) in #24990 could be modified to lower to some kind of UnderscoreFunction(f, ...) closure object with a default method that just calls f, but which you could overload to provide specific behaviors for particular functions f. A tricky thing would be to figure out the right encoding of the arguments etcetera in the type so that there is no performance penalty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:collections Data structures holding multiple items, e.g. sets domain:search & find The find* family of functions kind:deprecation This change introduces or involves a deprecation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants