Clean up search and find API #24673

nalimilan · 2017-11-20T21:17:30Z

This is a first step towards deprecating search in favor of findfirst and findnext (#10593). The same changes need to be applied to rsearch, but I figured it would make sense to discuss the design before since issues are similar.

Remarks/points to discuss:

Instead of allowing findnext(::Char, ::AbstractString, idx) to replace the equivalent search method, the PR requires doing findnext(equalto(c), haystack, idx). This can easily be changed later, but better be strict in 1.0. In the same vein, to find the first char from a set of possibilities, one needs to do findnext(c in chars, haystack, idx); passing chars directly is deprecated. findnext and findfirst are therefore more restrictive than split and replace, which accept a single char or an array of chars, but I think this difference is justified because find* functions are much more general/complex.
On the other hand, findnext(::Union{Regex,AbstractString}, ::AbstractString, idx) is supported even though it has no equivalent for arrays, because it's unambiguous. We could require something like findnext(seq(needle), haystack), which would also work for arrays, but that sounds too strict to me. Note that this method is an exception anyway since it returns a range of indices rather than a single index.

JeffBezanson · 2017-11-21T18:57:21Z

base/deprecated.jl

+# FIXME: no replacement to search for a multibyte char in a ByteArray
+@deprecate search(a::Union{String,ByteArray}, b::Union{Int8,UInt8}, i::Integer = 1) findnext(equalto(b), a, i)
+@deprecate search(a::String, b::Union{Int8,UInt8}, i::Integer = 1) findnext(equalto(Char(b)), a, i)
+@deprecate search(a::ByteArray, b::Char, i::Integer = 1) findnext(equalto(Unt8(b)), a, i)


typo Unt8

JeffBezanson · 2017-11-21T18:58:25Z

base/deprecated.jl

+@deprecate search(s::String, c::Char) findfirst(equalto(c), s)
+# FIXME: no replacement to search for a multibyte char in a ByteArray
+@deprecate search(a::Union{String,ByteArray}, b::Union{Int8,UInt8}, i::Integer = 1) findnext(equalto(b), a, i)
+@deprecate search(a::String, b::Union{Int8,UInt8}, i::Integer = 1) findnext(equalto(Char(b)), a, i)


I think this is breaking since the old behavior was to search for bytes. However we might want to deprecate this with no replacement since it doesn't really make sense to search for a UInt8 in a String.

Yes, these methods were really a mess. Some were probably supposed to be internal, but were de facto exported, and were too flexible regarding the accepted types. It's easy and efficient to do Vector{UInt8}(s) if you want to look for a byte.

Though, maybe better keep a semi-working deprecation than no deprecation at all? It's quite frequent to look for ASCII characters, for which the old method looking for bytes worked.

JeffBezanson · 2017-11-21T19:02:02Z

base/deprecated.jl

+@deprecate search(a::ByteArray, b::Char, i::Integer = 1) findnext(equalto(Unt8(b)), a, i)
+
+@deprecate search(s::AbstractString, c::Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}, i::Integer) findnext(x -> x in c, s, i)
+@deprecate search(s::AbstractString, c::Union{Tuple{Vararg{Char}},AbstractVector{Char},Set{Char}}) findfirst(x -> x in c, s, i)


How about adding occursin(c) for this function?

JeffBezanson · 2017-11-21T19:05:03Z

👍 Good, I like where this is going!

I believe findfirst and findnext currently only return single indices. If so, we should probably keep it that way and only use them to deprecate searchindex for substrings.

nalimilan · 2017-11-21T20:10:45Z

I believe findfirst and findnext currently only return single indices. If so, we should probably keep it that way and only use them to deprecate searchindex for substrings.

Right. Though the exception can be justified, since string and regex matches can span over several entries. We could extend this idea to any sequence of elements, e.g. to look for a vector inside another vector.

Do we really need searchindex? AFAICT to determine whether a substring matches, you need to go over it in full anyway, so you know where it ends and there should be no overhead to compute the end of the range.

StefanKarpinski · 2017-11-22T14:36:54Z

Has using nothing as "not found" sentinel return value been discussed? That seems in line with what where we're going in the data ecosystem.

nalimilan · 2017-11-22T14:51:16Z

Has using nothing as "not found" sentinel return value been discussed? That seems in line with what where we're going in the data ecosystem.

Yes, I think we need to discuss that, but that will have to be a separate PR since it doesn't affect only strings.

EDIT: see JuliaLang/Juleps#47

nalimilan · 2017-11-22T22:35:39Z

I've added commits deprecating rsearch and adding occursin. So I think it's mostly ready now. We just need to decide whether we should change the behavior of findnext(::String, ::String) and similar functions (see my comment). This will also determine what to do with searchindex and rsearchindex.

nalimilan · 2017-12-02T16:42:46Z

Bump: should findfirst/findlast return a range (current state of the PR) or an index to the beginning of the match? See my comments above. I'd like to merge this PR so that I can progress on other search & find issues.

timholy · 2017-12-03T10:11:35Z

I haven't had time to follow this, but my 2c is that if using a range in strings opens up the possibility of using a range for AbstractArray, that might be a good way to solve the "not found" issue (return an empty range rather than 0 as a sentinel) in a way that plays nicely with arbitrary indices.

nalimilan · 2017-12-03T11:10:29Z

@timholy That's actually a different issue. findfirst returns a range in the current state of the PR when looking for a string/regex within a string, because the matching substring is a sequences of entries (characters). I'm not proposing that looking e.g. for an Int within a Vector{Int} or for a Char in a String returns a range, because the end of the range adds no information. For cases where findfirst currently returns 0, I think the only natural/practical approach that generalizes is to return nothing (JuliaLang/Juleps#47).

BTW, an argument for findfirst(::String, ::String) to return a range (current state of the PR) is that if we return only the first index of the match, people will end up computing the end of the match by hand via things like start + length(haystack), which will work in ASCII but not in general. So better return the full information by default from the most obvious function people will try.

timholy · 2017-12-03T11:54:06Z

For cases where findfirst currently returns 0, I think the only natural/practical approach that generalizes is to return nothing (JuliaLang/Juleps#47).

Very reasonable, esp. since union-splitting keeps the performance hit low. My point was simply that returning n:n for a hit and 1:0 for a miss also works for the array case, and if a range is better for strings then perhaps this is an argument for not making you change more stuff here 🙂. I haven't benchmarked union-splitting vs a type-stable branch-on-value, if they are the same then performance isn't a reason to decide one way or another (but might be if they are not the same).

kmsquire · 2017-12-11T13:52:51Z

As a point of reference, searchsorted currently returns a range, and when the element being searched for isn’t present, it returns and empty range indicating the insertion point.

A long time ago (probably in Julia 0.3), I did some ad hock benchmarking, and there was some overhead in creating the range compared with returning a single integer. However, the compiler has changed dramatically since then.

It’s also unclear how useful it is to always get the range of indices containing a particular value, vs, e.g., just getting the first or last insertion point.

nalimilan · 2017-12-11T15:30:38Z

See #24883 about searchsorted* functions.

c42f · 2018-01-01T05:27:43Z

Yes, don't hold up on my accord. I'm not thrilled by contains, but it's quite possible I was abusing the intent of ismatch previously (even though the name itself seemed good).

I do wonder whether it would be good to have a general verb for pattern matching. contains seems very specific to the string case.

StefanKarpinski · 2018-01-04T17:22:54Z

I would rather merge this in its current form as a massive improvement in consistency and then debate some of the finer points after we've merged. We're not going to get this 100% right and that's ok, we'll learn some lessons during 1.x and do even better in 2.0.

stevengj · 2018-02-03T04:03:30Z

It would be good to have Compat support for this, e.g. to replace rsearch calls...

nalimilan · 2018-02-03T15:05:39Z

Yeah, that's quite a mess to support given all the changes. I've filed JuliaLang/Compat.jl#484 to implement most of the changes made here, but some from other PRs will have to be addressed separately.

andyferris · 2018-02-04T03:29:33Z

I'm late to the party, but I've been thinking about equalto and occursin.

These are constructors for EqualTo a OccursIn. I thought we had stopped making special lower-case constructors in favor of actual type constructors (so long as the constructor makes sense, which I believe it does in this case).
If the interface will be uppercase types, why not have IsEqual and In? Having the same name as the corresponding function would help discoverability and readability.

PS I do love the direction this PR has taken us in. I'm thinking using and reading stuff like getting all the indices of the NaNs in a vector via findall(IsEqual(NaN), vector_of_floats) and getting all the points within a given sphere via filter(In(sphere), vector_of_points) is reasonably nice.

As it stands I don't really love occursin and I naively assumed equalto would use == not isequal.

nalimilan · 2018-02-04T18:27:07Z

These are constructors for EqualTo a OccursIn. I thought we had stopped making special lower-case constructors in favor of actual type constructors (so long as the constructor makes sense, which I believe it does in this case).

Do you have a reference? We still use lowercase constructors systematically in Iterators. I think the idea is that the type is essentially an implementation detail.

I kind of agree the names aren't great, for example we use the third person for occursin but we don't say equals. It was proposed to just use in(x), which would also work for isequal(x), but I was concerned about the consistency with other functions: where should we stop adding currying methods? We could also introduce a subset of #24990 with special cases for these two functions (which is needed for efficient dispatch), making _ in x create a OccursIn(x) object and isequal(_, x)/_ == x create an EqualTo(x) object.

stevengj · 2018-02-13T17:33:00Z

Rather than having special lowering for _ in x, maybe f(_,x) in #24990 could be modified to lower to some kind of UnderscoreFunction(f, ...) closure object with a default method that just calls f, but which you could overload to provide specific behaviors for particular functions f. A tricky thing would be to figure out the right encoding of the arguments etcetera in the type so that there is no performance penalty.

nalimilan added the domain:search & find The find* family of functions label Nov 20, 2017

nalimilan mentioned this pull request Nov 20, 2017

Unifying search & find functions #10593

Closed

ararslan requested a review from JeffBezanson November 21, 2017 01:15

nalimilan force-pushed the nl/search branch 2 times, most recently from 7e109ce to e4cc7b6 Compare November 21, 2017 18:09

JeffBezanson reviewed Nov 21, 2017

View reviewed changes

ararslan added domain:collections Data structures holding multiple items, e.g. sets kind:deprecation This change introduces or involves a deprecation labels Nov 21, 2017

Sacha0 mentioned this pull request Nov 22, 2017

Find Julep: issue with sentinel values JuliaLang/Juleps#47

Closed

nalimilan force-pushed the nl/search branch from e4cc7b6 to 2cbca57 Compare November 22, 2017 22:35

nalimilan force-pushed the nl/search branch 3 times, most recently from 22d1f89 to 34751ed Compare November 23, 2017 15:43

nalimilan mentioned this pull request Nov 24, 2017

Remove bullet mentioning removal of findin JuliaLang/Juleps#34

Merged

nalimilan force-pushed the nl/search branch from 34751ed to 79c492d Compare December 2, 2017 15:34

nalimilan mentioned this pull request Dec 7, 2017

Deprecate findin(a, b) in favor or find(occursin(b), a) #24967

Closed

nalimilan force-pushed the nl/search branch from 79c492d to d6b213d Compare December 15, 2017 22:10

juliohm mentioned this pull request Jan 3, 2018

Comparison operator in findmin() variants #14216

Open

nalimilan force-pushed the nl/search branch 3 times, most recently from e4a3a06 to 8610b63 Compare January 4, 2018 14:07

nalimilan added 4 commits January 4, 2018 16:52

Merge search into findfirst/findnext and rsearch into findlast/findprev

93eef2d

Deprecate (r)searchindex(...) in favor of first(findnext/findprev(...))

d5f74cd

Deprecate ismatch(r, s) in favor of contains(s, r)

6ca43fc

Deprecate findin(a, b) in favor of find(occursin(b), a)

ff8b17d

nalimilan force-pushed the nl/search branch from 8610b63 to ff8b17d Compare January 4, 2018 15:53

JeffBezanson approved these changes Jan 4, 2018

View reviewed changes

nalimilan merged commit 1950086 into master Jan 4, 2018

nalimilan deleted the nl/search branch January 4, 2018 19:38

nalimilan mentioned this pull request Jan 5, 2018

Rename searchsorted* functions to findsorted* #25414

Closed

KristofferC mentioned this pull request Jan 8, 2018

Combine contains and ismatch? #19250

Closed

willow-ahrens mentioned this pull request Jan 8, 2018

Pja/fix julia home problems JuliaInterop/Cxx.jl#368

Closed

nalimilan mentioned this pull request Feb 11, 2018

find* functions and pairs()/keys() #25999

Closed

nalimilan mentioned this pull request Mar 7, 2018

Deprecate contains to occursin, deprecate callable regexes #26283

Merged

vchuravy mentioned this pull request Mar 15, 2018

increase MAX_INTERP_STATE_SIZE on armv7-l to 48 #26477

Merged

cormullion mentioned this pull request Mar 18, 2018

NEWS.md is getting a bit untidy #26508

Closed

jmkuhn mentioned this pull request Jun 19, 2018

Rename ismatch to occursin; fix 0.7 depwarns vtjnash/Glob.jl#15

Merged

sjoelund mentioned this pull request Aug 17, 2018

Changed examples to match tuples due to expression change, updated so… JuliaServices/Match.jl#48

Closed

nalimilan mentioned this pull request Sep 11, 2018

findall(::Fix2{typeof(in)}, array) and _findin make assumptions on finiteness #29124

Open

mbauman mentioned this pull request Dec 14, 2018

Rename occursin to something more reasonable #30368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up search and find API #24673

Clean up search and find API #24673

nalimilan commented Nov 20, 2017

JeffBezanson Nov 21, 2017 •

edited

Loading

JeffBezanson Nov 21, 2017

nalimilan Nov 21, 2017

nalimilan Nov 21, 2017

JeffBezanson Nov 21, 2017

JeffBezanson commented Nov 21, 2017

nalimilan commented Nov 21, 2017

StefanKarpinski commented Nov 22, 2017

nalimilan commented Nov 22, 2017 •

edited

Loading

nalimilan commented Nov 22, 2017

nalimilan commented Dec 2, 2017

timholy commented Dec 3, 2017

nalimilan commented Dec 3, 2017

timholy commented Dec 3, 2017

kmsquire commented Dec 11, 2017

nalimilan commented Dec 11, 2017 •

edited

Loading

c42f commented Jan 1, 2018 •

edited

Loading

StefanKarpinski commented Jan 4, 2018

stevengj commented Feb 3, 2018 •

edited

Loading

nalimilan commented Feb 3, 2018

andyferris commented Feb 4, 2018

nalimilan commented Feb 4, 2018

stevengj commented Feb 13, 2018 •

edited

Loading

Clean up search and find API #24673

Clean up search and find API #24673

Conversation

nalimilan commented Nov 20, 2017

JeffBezanson Nov 21, 2017 • edited Loading

Choose a reason for hiding this comment

JeffBezanson Nov 21, 2017

Choose a reason for hiding this comment

nalimilan Nov 21, 2017

Choose a reason for hiding this comment

nalimilan Nov 21, 2017

Choose a reason for hiding this comment

JeffBezanson Nov 21, 2017

Choose a reason for hiding this comment

JeffBezanson commented Nov 21, 2017

nalimilan commented Nov 21, 2017

StefanKarpinski commented Nov 22, 2017

nalimilan commented Nov 22, 2017 • edited Loading

nalimilan commented Nov 22, 2017

nalimilan commented Dec 2, 2017

timholy commented Dec 3, 2017

nalimilan commented Dec 3, 2017

timholy commented Dec 3, 2017

kmsquire commented Dec 11, 2017

nalimilan commented Dec 11, 2017 • edited Loading

c42f commented Jan 1, 2018 • edited Loading

StefanKarpinski commented Jan 4, 2018

stevengj commented Feb 3, 2018 • edited Loading

nalimilan commented Feb 3, 2018

andyferris commented Feb 4, 2018

nalimilan commented Feb 4, 2018

stevengj commented Feb 13, 2018 • edited Loading

JeffBezanson Nov 21, 2017 •

edited

Loading

nalimilan commented Nov 22, 2017 •

edited

Loading

nalimilan commented Dec 11, 2017 •

edited

Loading

c42f commented Jan 1, 2018 •

edited

Loading

stevengj commented Feb 3, 2018 •

edited

Loading

stevengj commented Feb 13, 2018 •

edited

Loading