Add vectorized "in" (.∈) and "notin" (.∉) #12406

alyst · 2015-07-31T09:57:09Z

Add vectorized "in" (.∈) and "not in" (.∉) operators to be on par with R.
This is quite handy in combination with DataFrames for dataset filtering.

KristofferC · 2015-07-31T10:44:56Z

When you benchmark it is better to put code in a function. It avoids spurious results from evaluations in global scope. Compare:

x = collect(1:1000);
@time for i in 1:1000 Bool[(xx ∈ 1:100) for xx in x] end
# 0.306228 seconds (5.96 M allocations: 152.985 MB, 6.32% gc time)

function f() 
    x = collect(1:1000)
    for i in 1:1000 Bool[(xx ∈ 1:100) for xx in x] end
end
@time f()
# 0.002487 seconds (1.01 k allocations: 1.045 MB)

alyst · 2015-07-31T13:26:51Z

@KristofferC Wow, that made a huge difference!

function bench_comprehension(n::Int) 
    x = collect(1:1000);
    for i in 1:n Bool[(xx ∈ 1:100) for xx in x]  end
end
@time bench_comprehension(10^6)
# 2.858 seconds      (1000 k allocations: 1038 MB, 7.97% gc time)

function bench_functor(n::Int) 
    x = collect(1:1000);
    for i in 1:n x .∈ 1:100 end
end
@time bench_functor(10^6)
# 3.460 seconds      (1000 k allocations: 1038 MB, 6.39% gc time)

StefanKarpinski · 2015-07-31T16:37:31Z

base/reduce.jl

@@ -380,6 +380,10 @@ const ∈ = in
 ∋(itr, x)= ∈(x, itr)
 ∌(itr, x)=!∋(itr, x)

+# vectorized ∈ and ∉
+(.∈){T}(x::AbstractArray{T}, set) = [xx ∈ set for xx in x]
+(.∉){T}(x::AbstractArray{T}, set) = [xx ∉ set for xx in x]


These should produce BitArrays; this should also do broadcasting the way other vectorized ops do.

alyst · 2015-08-10T12:37:20Z

@StefanKarpinski It definitely makes sense, but I need some guidance to make it right.

Unlike the common broadcasting case (e.g. .==), where the operands of the "kernel" operator are of the same nature (and promotable to some common type), in() tests a single element against the collection.
At the moment PR implements only the case when each element of a vector x is tested against a single fixed collection set. As in(), .∈ doesn't declare the type of set (so that any collection that implements in() would be automatically supported). Full vectorization requires "single element vs vector of sets" and "vector of elements vs vector of sets" methods.

Ideally, one would like to have in{S,T}(x::S, y::AbstractCollection{T}) declaration, where S and T could be promoted to some common type and (.∈){S,T}(x::AbstractVector{S}, y::AbstractVector{AbstractCollection{T}}) etc. The first problem is that there's no AbstractCollection type. The second problem is that I don't know how to express that S and T are promotable.

Actually, even the proposed PR cannot discriminate "vector of elements vs single set" and "set vs single collection of sets" cases.

JeffBezanson · 2015-08-10T16:45:28Z

I'm not a big fan of this. It's a bit confusing to vectorize operations that are already collection operations, and I'd rather not encourage more special-case vector notation (#8450).

alyst · 2015-08-10T21:36:11Z

@JeffBezanson If #8450 would lead to some new syntax for an easy vectorization, it would be fantastic.
The current state is, sort of, ambivalent: the wide unicode palette is made available for writing concise code, but in practice it results in more obscure and redundant lines (some DataFrame-insipred example below):

d[:is_selected] = map(id -> id ∈ sel_ids, d[:id]) & map(t -> t ∈ sel_types, d[:type])

alyst · 2015-09-11T13:47:15Z

So the reaction to .∈ appears to be rather skeptical.
Would it be more appealing if the PR is reduced just to the changes in the parser, so that .∈ is treated as a valid operator syntax? Then it would still be possible to define .∈ in the package(s).

nalimilan · 2016-01-03T12:11:55Z

See old issue filed at #5212.

tkelman · 2016-05-11T18:01:21Z

closing since the infix is covered as a special case of #14544, the prefix function form works now with #15032.

add vectorized "in" (.∈) and "notin" (.∉)

e1d241d

alyst force-pushed the vectorized_in branch from f553520 to e1d241d Compare July 31, 2015 13:30

StefanKarpinski reviewed Jul 31, 2015
View reviewed changes

tkelman mentioned this pull request Jan 3, 2016

Allow users to define "dot" vectorized operators. #14544

Closed

tkelman closed this May 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vectorized "in" (.∈) and "notin" (.∉) #12406

Add vectorized "in" (.∈) and "notin" (.∉) #12406

alyst commented Jul 31, 2015

KristofferC commented Jul 31, 2015

alyst commented Jul 31, 2015

StefanKarpinski Jul 31, 2015

alyst commented Aug 10, 2015

JeffBezanson commented Aug 10, 2015

alyst commented Aug 10, 2015

alyst commented Sep 11, 2015

nalimilan commented Jan 3, 2016

tkelman commented May 11, 2016

Add vectorized "in" (.∈) and "notin" (.∉) #12406

Add vectorized "in" (.∈) and "notin" (.∉) #12406

Conversation

alyst commented Jul 31, 2015

KristofferC commented Jul 31, 2015

alyst commented Jul 31, 2015

StefanKarpinski Jul 31, 2015

Choose a reason for hiding this comment

alyst commented Aug 10, 2015

JeffBezanson commented Aug 10, 2015

alyst commented Aug 10, 2015

alyst commented Sep 11, 2015

nalimilan commented Jan 3, 2016

tkelman commented May 11, 2016