Unifying search & find functions #10593

nalimilan · 2015-03-20T23:16:12Z

Currently there are three families of search & find functions:

find findn findin findnz findmin findmax findfirst findlast findprev findnext
[r]search [r]searchindex searchsorted searchsortedlast searchsortedfirst
indexin

In the find family, find, findn return indexes of non-zero or true values. findfirst, findlast, findprev and findnext are very similar to find, but kind of iterative, and they additionally allow looking for an element in a collection (the latter behavior being similar to findin). The findmin and findmax functions are different, as they return the value and index of the min/max. Finally, findnz is even more different as it only works on matrices and returns a tuple of vectors (I,J,V) for the row- and column-index and value.

In the search family, [r]search and [r]searchindex look for strings/chars/regex in a string (though they also support bytes), the former returning a range, the latter the first index. searchsorted, searchsortedlast and searchsortedfirst look for values equal to or lower than an argument, and return a range for the first, and index for the two others.

indexin is the same as findin (i.e. returns index of elements in a collection), but it returns 0 for elements that were not found, instead of a shorter vector.

I hope that summary is exact. Please correct me if not.

Questions/ideas:

Couldn't findin be renamed to find, as the signatures do not conflict? That would mean findfirst, findlast, findprev and findnext would just be iterating versions of find. Currently find offers less methods than the others.
That way, indexin could be renamed to findin to reunite the family (or add an argument to switch behaviors?)
What justifies the difference in vocabulary between find and search? I suggest we rename all search functions to find*: searchsorted* would become findsorted*, searchindex would be merged with findfirst, rsearchindex with findlast.
search could be renamed to findfirstrange, and rsearch to findlastrange, making them find any sequence of values in any collection, and not only in strings; if not, nicer names could be findstr and findrstr.
That way, you can easily get a list of interesting functions by typing find[tab][tab].
Maybe the series findfirst, findlast, findprev and findnext could be replaced/supplemented with an iterator eachfind/findeach?

The text was updated successfully, but these errors were encountered:

nalimilan · 2015-03-20T23:33:20Z

Ah, just found #5664. Keeping this issue open because of the long summary above.

There's also #7327, which is about finding maximum and minimum. It could be more logical to change findmax and findmin to return the linear index (was indmax and indmin currently do), in order to be consistent with other find* functions. Just and idea.

See the spreadsheet at https://docs.google.com/spreadsheets/d/1ZLnlYQyRIWa50-mxOmKHNCaJEzGShLVSEkfid2KA-Ps/edit#gid=0

nalimilan · 2015-04-06T16:16:03Z

Any comments?

ViralBShah · 2015-04-06T17:41:15Z

+1 to the general unification of these APIs. If there is general consensus, we should also do it immediately.

mbauman · 2015-04-06T17:43:30Z

I'm a little disappointed this hasn't garnered any attention. I'd love to see these functions cleaned up, too… I often have difficulty remembering which function to use if it's been a little while. I'm afraid that nobody has dared comment simply because the scope here is potentially huge.

Some thoughts:

I really like moving findnext and friends into a findeach iterator. It'd be a net loss of functionality as you couldn't change the predicate of the iterator mid-stream… but that use-case is pretty limited.
I think ideally the sorted methods should be moved into dispatch on a SortedVector type. There's been talk of this before but I don't think there's an open issue for it.

In general, though, I think we should start from scratch and define what we want the find name to mean as @JeffBezanson suggests in #5664 (comment). In that context, I think there are a few key questions:

Do we want to support find by value? I think everyone will agree that find(A) and find(predicate, A) return indices of nonzero elements and where predicate(elt) returns true, respectively. The next/prev versions also support a findnext(A, v, i) method to find the next value v after index i. Similarly, findin(a, b) returns indexes in a that match a value in b. I think we should only fold findin→find if the general form supports find by value. In practice, most folks simply use find(A.==v) for this purpose, but that doesn't compose well with a findeach iterated method.

As a practical matter, we simply cannot fold in all these behaviors to one name, so perhaps that is the significant difference between search and find?

find(A) # Indices of nonzero elements
find(predicate, A) # would prefer to duck-type predicate and just call it
find(A, values) # like findin; would prefer to duck-type values and just iterate over them
find(A, value) # somewhat like search (but all at once); just want to check equality

Do we want search and find to be consistent in operating iteratively or all-at-once? Currently, I think some of the confusion for me comes from the fact that the two names have very similar purposes but different interfaces.
How do we detangle substring search from searching for a bundle of characters? Again, this is a crucial difference in how search and findnext operate, and is generalizable (albeit perhaps not as useful) beyond strings.

mbauman · 2015-04-06T19:14:59Z

In terms of orthogonal design elements, we currently have a jumbled mishmash of combinations of:

Return indices of: non-zeros, predicate-test-true, elements in collection, elements equal to value, extrema, or a range of elements matching a sequence.
Operate: Iteratively forwards (or the first), iteratively backwards (or the last), or all-at-once

nalimilan · 2015-04-06T19:29:00Z

Thanks, that's a useful way of summarizing the requirements

mbauman · 2015-04-06T20:36:42Z

It took me a while to get there, but I think that's the most useful way to think about this. Then the question is simply what names we want to give to those capabilities. Here's one terribly disruptive possibility:

	nonzeros	predicate test	in collection `c`	equal to value `v`	sequence or regex `s`
Iteratively forwards	`search(A)`	`search(pred,A)`	`searchin(A,c)`	`searcheq(A,v)`	`searchseq(A,s)`
Iteratively backwards	`rsearch(A)`	`rsearch(pred,A)`	`rsearchin(A,c)`	`rsearcheq(A,v)`	`rsearchseq(A,s)`
All at once	`find(A)`	`find(pred,A)`	`findin(A,c)`	`findeq(A,v)`	`findseq(A,s)`

The search and rsearch methods would take an optional integer argument for the start index. I don't think we can tuck iterative searching entirely into an iterator object since you often want to start at a known index (and not just some internal iterator state). We could also add methods to return an iterator, but I don't have a good name for that — findeach becomes a bit clumsy with -in, -eq and -seq suffixes.

(I don't really like this because it makes gives a pretty limited meaning to the very nice name search, but it's just a spitball to get the ball rolling.)

nalimilan · 2015-04-06T21:03:00Z

Nice table. We may be able to merge searcheq/findeq into search/find. For such a basic usage, using most basic function makes sense.

But I'm not a fan of the search vs. find naming convention: I don't find it very obvious, and the fact that there's no common prefix makes it harder to find using ? or tab completion. Why not findf (f for forward) and findr? Or even findfirst and findlast?

hayd · 2015-04-06T21:05:14Z

Or even findfirst and findlast?

or findnext and findprev.

mbauman · 2015-04-06T21:09:25Z

I went with this because I find combinations of more than two words pretty unreadable and I went with suffixes for the kind of searching operation. This would result with things like findnextin or findprevseq.

I think we could only unify the -eq functions if we restrict the predicates to be ::Function or ::Union(DataType, Function) (which is currently called Base.Callable, but is terribly misleading since anything can be callable these days).

mbauman · 2015-04-06T21:23:11Z

We could call them all find*:

find(A, start::Int, rev::Bool=false) # or dir::Order.Ordering=Order.Forward

Edit: this needs more explanation: without the start index, it's an all-at-once operation:

	nonzeros	predicate test	in collection `c`	equal to value `v`	sequence or regex `s`
Iteratively forwards	`find(A,1)`	`find(pred,A,1)`	`findin(A,c,1)`	`findeq(A,v,1)`	`findseq(A,s,1)`
Iteratively backwards	`find(A, endof(A),true)`	`find(pred,A, endof(A),true)`	`findin(A,c, endof(A),true)`	`findeq(A,v, endof(A),true)`	`findseq(A,s, endof(A),true)`
All at once	`find(A)`	`find(pred,A)`	`findin(A,c)`	`findeq(A,v)`	`findseq(A,s)`

pao · 2015-04-06T21:51:50Z

dir::Order.Ordering=Order.Forward

Way clearer than a Boolean. A Boolean would be something I'd need to check the docs for every time I encountered it.

mbauman · 2015-04-06T22:29:08Z

That's what I had initially, but Base.Order and Order.Forward are both unexported… and so I changed to match the sorting API. That's a minor issue, though. I'm not sure I like having both iterative and all-at-once behaviors under the same name.

kmsquire · 2015-04-07T02:01:29Z

@mbauman, I really like your distinction of operation verses how to operate ("iteratively" vs "all-at-once").

I would suggest that the "all-at-once" operations are actually closer to filtering, though, and these seem like different enough concepts from the "iterative" operations to warrant a distinct name.

I actually kind of liked your first search vs find suggestion. I'm not a fan of the current search vs find situation--it is a mishmash--but if there was actually some rationale to the distinction, I would (probably) be fine with it, and the one you suggested seems reasonable to me.

(Also: referring to the table for find above, to me, find(A,1) suggests reducing along dimension 1--a Matlab-ism, to be sure, but still one present in a number of Julia functions--sum, prod, max, etc.--I think that would be inconsistent and somewhat confusing.)

kmsquire · 2015-04-07T02:02:38Z

Some relevant reading:

http:https://stackoverflow.com/questions/480811/semantic-difference-between-find-and-search
http:https://english.stackexchange.com/questions/21020/difference-between-find-and-search

nalimilan · 2015-04-07T07:56:04Z

@kmsquire As I understand it, the outcome of the threads you link to is that there's no clear distinction between "find" and "search", except that the latter insists on the process and may no return anything (but in our case, find may not return anything either...) Stretching that idea a bit, one could decide that find means "get all matches" and search means "return the first occurrence after a given index, possibly going in reverse direction, so that I can write a more complex search loop". But I'm not sure that's really obvious.

Otherwise, I find that the idea of merging forward/reverse search is appealing, but it can get quite confusing as the start index is not in the same position in find(A, 1) and find(pred, A, 1). Maybe the search order should always be specified before that, to make the distinction between the two series of functions clearer: find(A, Order.Forward, 1) and find(pred, A, Order.Forward, 1). That way, the index would be optional when you only want to get the first (or last for Order.Reverse) result.

toivoh · 2015-04-07T15:23:13Z

I think the distinction at least makes some sense if you describe the two cases as eg "Find all matches" and "Search for the first match after the given index" where the first one will indeed always find all matches (though they may be zero) but the other one might not find the next match if there is none.

nalimilan · 2016-02-28T19:26:15Z

How about this plan, in which everything would be called find*.

The short find* versions return all matches.
When added Order.Forward or Order.Backward as an argument, they return an iterator.
When further added a starting index, they return the first result after that index (or before when searching backwards).

This is essentially @mbauman's last table from #10593 (comment), except that the boolean is replaced with Order.Forward or Order.Backward, and that another row is added for getting an iterator.

I can have a look at a PR if you agree.

These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103

nalimilan · 2017-12-07T17:27:36Z

Status update which covers everything in the Proposal 3 of the Search and Find Julep and in related discussion points:

PR Clean up search and find API #24673 deals with search and rsearch. The only question that remains to be decided is whether it's OK for findnext(::String, ::String) and similar to return a range. searchindex and rsearchindex are deprecated in favor of first(findnext(...)) and first(findprev(...)).
PR Clean up search and find API #24673 also merges ismatch into contains. match and eachmatch can be kept as-is, since they return a special RegexMatch object rather than indices (see also issue Combine contains and ismatch? #19250 and previous PR Add method to contains for regex needle #18028).
PR Clean up search and find API #24673 also deprecates findin(a, b) in favor of find(occursin(b), a). See also issue Deprecate findin(a, b) in favor or find(occursin(b), a) #24967.
PR Change find() to return the same index type as pairs() #24774 changes find to return the same index type as pairs/keys and makes it work with any collection. It's surprisingly simple, though it's waiting for a Nanosoldier run (which may prompt some adjustments to the AbstractArray{Bool} method).
PR Change findfirst/findlast/findnext/findprev to return the same index type as keys() #25577 changes findfirst, findlast findnext and findprev to accept/return the same type of index as find since Change find() to return the same index type as pairs() #24774.
PR Change sentinel in find(first|next|prev|last) to nothing #25472 changed findnext, findprev, findfirst and findlast to return nothing rather than 0 when there is no match, in order to support arrays with custom indices and arbitrary collections. See also Find Julep: issue with sentinel values Juleps#47.
PR Rename find() to findall() #25545 renames find to findall so that its purpose is more explicit.
PR Deprecate findn(x) in favor of find(!iszero, x), which now returns cartesian indices #25532 deprecates findn(x) in favor of find(!iszero, x) now that find returns cartesian indices.
PR Move findnz to SparseArrays module #25641 moves findnz to the new SparseArrays stdlib module (see also issue Renaming findn and findnz? #24910).
PR Change findfirst and findlast to return cartesian indices with HasShape iterators #25655 makes findfirst and findlast return cartesian indices for HasShape iterators, for consistency with arrays.
Issue Rename findmin and findmax? #24865 discusses renaming findmin and findmax so that their names are more consistent with find* functions returning indices rather than (index, value) tuples.
The findeach function proposed by @JeffBezanson in the Julep can be introduced after 1.0, and findnext/findprev/findfirst/findlast be made simple convenience functions around it.
PR Rename searchsorted* functions to findsorted* #25414 renames searchsorted* functions to findsorted* and reorders their arguments for consistency with other find* functions (see also issue What to do with searchsorted* functions #24883). EDIT: triage decided not to do this.

These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close JuliaLang#22611 Close JuliaLang#24613 See also: JuliaLang#10593 JuliaLang#23612 JuliaLang#24103

nalimilan · 2017-12-31T15:05:43Z

Even if the API implemented in the above PRs is much more consistent than the previous one, I wonder a few more changes wouldn't make it even better. I'm now tempted to say that an ideal design would involve renaming find to findall (which is more explicit), and replace findfirst(needle, haystack) with find(needle, haystack), findnext(needle, haystack, i) with find(needle, haystack, i), findlast(needle, haystack) with find(needle, haystack; rev=true) and find(needle, haystack, i; rev=true).

The two potential issues with this proposal are

The find change cannot happen in a single deprecation cycle, at least for the two-argument version. But we could require using the three-argument form in 0.7, given that you just need to pass the first/last index of the haystack explicitly.
Keyword arguments need to be fast enough for rev. If not, we can still make it a positional argument.

Opinions?

JeffBezanson · 2017-12-31T15:59:24Z

I like the idea of renaming find to findall, but I also like the clarity of findfirst.

Sacha0 · 2017-12-31T17:03:33Z

findall is beautifully descriptive :).

StefanKarpinski · 2017-12-31T18:11:09Z

I think the set findall, findfirst, findnext, findlast and findprev is nicely explicit. Sure, there's some overlap but it kind of mirrors what we now have for string indices.

JeffBezanson · 2018-01-04T22:48:27Z

#24673 closes this as far as I'm concerned.

nalimilan · 2018-01-05T10:40:36Z

@JeffBezanson I don't think so, see the check list in my comment above. In particular there's still #24774 and JuliaLang/Juleps#47.

We can also rename find to findall since that proposal got a lot of support above (but that's really easy).

timholy · 2018-02-05T22:24:56Z

Just wanted to say that despite the fact that I barely interacted with @nalimilan's work here, the new names seem so compelling that when I recently had a 0.6-based project, I could barely remember the old way of doing this stuff and had to look at the docs several times ("right, search, that's what I was looking for..."). A clear sign of success.

sarah-ji · 2018-04-20T20:00:36Z

Hi everyone! I'm not sure if anyone will see this but I am having some trouble finding the indices from a user inputted string.

I have a string of ID's I am interested in finding:
ppl_id = ["T2DG0200031", "T2DG0200032", "T2DG0200033"]

and want to search the second column of a matrix which has all ID's ( of type Array{AbstractString,2}) for the indices that match the strings in ppl_id.

Any ideas...? It seems I can't get this to work with find() or findeach() or what you guys have mentioned above..

fredrikekre · 2018-04-20T20:21:03Z

Questions are better suited for the forum: https://discourse.julialang.org/

ihnorton added the needs decision A decision on this change is needed label Mar 28, 2015

nalimilan mentioned this issue Jun 9, 2015

findin does not work for strings? #11630

Closed

nalimilan mentioned this issue Jun 16, 2015

Consistent argument order, Vectorization #11722

Closed

This was referenced Apr 1, 2016

invalid character index in findfirst #15723

Closed

findfirst and findnext for general iterables #15755

Closed

StefanKarpinski added the design Design of APIs or of the language itself label Sep 13, 2016

StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016

StefanKarpinski removed the needs decision A decision on this change is needed label Sep 13, 2016

JeffBezanson mentioned this issue Oct 17, 2016

rename search and find functions? #5664

Closed

This was referenced Nov 2, 2016

Inconsistent order of arguments in variants of 'find' #19186

Closed

Combine contains and ismatch? #19250

Closed

miakramer mentioned this issue Nov 13, 2016

Inconsistent Argument Order #19314

Closed

nalimilan mentioned this issue Dec 2, 2017

What to do with searchsorted* functions #24883

Closed

nalimilan mentioned this issue Dec 7, 2017

Deprecate findin(a, b) in favor or find(occursin(b), a) #24967

Closed

StefanKarpinski mentioned this issue Dec 14, 2017

Rename findmin and findmax? #24865

Closed

nalimilan added the status:triage This should be discussed on a triage call label Dec 21, 2017

JeffBezanson closed this as completed Jan 4, 2018

JeffBezanson removed the status:triage This should be discussed on a triage call label Jan 4, 2018

nalimilan reopened this Jan 5, 2018

nalimilan mentioned this issue Jan 5, 2018

Rename searchsorted* functions to findsorted* #25414

Closed

This was referenced Jan 12, 2018

Deprecate findn(x) in favor of find(!iszero, x), which now returns cartesian indices #25532

Merged

Rename find() to findall() #25545

Merged

Change findfirst/findlast/findnext/findprev to return the same index type as keys() #25577

Merged

JeffBezanson closed this as completed Jan 18, 2018

This was referenced Jan 19, 2018

Move findnz to SparseArrays module #25641

Merged

Change findfirst and findlast to return cartesian indices with HasShape iterators #25655

Merged

cormullion mentioned this issue Mar 18, 2018

NEWS.md is getting a bit untidy #26508

Closed

nalimilan mentioned this issue Jun 5, 2023

Single naming convention for functions returning indices in 2.0 #50003

Open

nalimilan mentioned this issue Aug 13, 2023

findfirst dispatches on ::Function #49085

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unifying search & find functions #10593

Unifying search & find functions #10593

nalimilan commented Mar 20, 2015

nalimilan commented Mar 20, 2015

nalimilan commented Apr 6, 2015

ViralBShah commented Apr 6, 2015

mbauman commented Apr 6, 2015

mbauman commented Apr 6, 2015

nalimilan commented Apr 6, 2015

mbauman commented Apr 6, 2015

nalimilan commented Apr 6, 2015

hayd commented Apr 6, 2015

mbauman commented Apr 6, 2015

mbauman commented Apr 6, 2015

pao commented Apr 6, 2015

mbauman commented Apr 6, 2015

kmsquire commented Apr 7, 2015

kmsquire commented Apr 7, 2015

nalimilan commented Apr 7, 2015

toivoh commented Apr 7, 2015 via email

nalimilan commented Feb 28, 2016

nalimilan commented Dec 7, 2017 •

edited by mbauman

Loading

nalimilan commented Dec 31, 2017

JeffBezanson commented Dec 31, 2017

Sacha0 commented Dec 31, 2017

StefanKarpinski commented Dec 31, 2017

JeffBezanson commented Jan 4, 2018

nalimilan commented Jan 5, 2018

timholy commented Feb 5, 2018

sarah-ji commented Apr 20, 2018

fredrikekre commented Apr 20, 2018

Unifying search & find functions #10593

Unifying search & find functions #10593

Comments

nalimilan commented Mar 20, 2015

nalimilan commented Mar 20, 2015

nalimilan commented Apr 6, 2015

ViralBShah commented Apr 6, 2015

mbauman commented Apr 6, 2015

mbauman commented Apr 6, 2015

nalimilan commented Apr 6, 2015

mbauman commented Apr 6, 2015

nalimilan commented Apr 6, 2015

hayd commented Apr 6, 2015

mbauman commented Apr 6, 2015

mbauman commented Apr 6, 2015

pao commented Apr 6, 2015

mbauman commented Apr 6, 2015

kmsquire commented Apr 7, 2015

kmsquire commented Apr 7, 2015

nalimilan commented Apr 7, 2015

toivoh commented Apr 7, 2015 via email

nalimilan commented Feb 28, 2016

nalimilan commented Dec 7, 2017 • edited by mbauman Loading

nalimilan commented Dec 31, 2017

JeffBezanson commented Dec 31, 2017

Sacha0 commented Dec 31, 2017

StefanKarpinski commented Dec 31, 2017

JeffBezanson commented Jan 4, 2018

nalimilan commented Jan 5, 2018

timholy commented Feb 5, 2018

sarah-ji commented Apr 20, 2018

fredrikekre commented Apr 20, 2018

nalimilan commented Dec 7, 2017 •

edited by mbauman

Loading