Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vectorized "in" (.∈) and "notin" (.∉) #12406

Closed
wants to merge 1 commit into from

Conversation

alyst
Copy link
Contributor

@alyst alyst commented Jul 31, 2015

Add vectorized "in" (.∈) and "not in" (.∉) operators to be on par with R.
This is quite handy in combination with DataFrames for dataset filtering.

@KristofferC
Copy link
Sponsor Member

When you benchmark it is better to put code in a function. It avoids spurious results from evaluations in global scope. Compare:

x = collect(1:1000);
@time for i in 1:1000 Bool[(xx  1:100) for xx in x] end
# 0.306228 seconds (5.96 M allocations: 152.985 MB, 6.32% gc time)
function f() 
    x = collect(1:1000)
    for i in 1:1000 Bool[(xx  1:100) for xx in x] end
end
@time f()
# 0.002487 seconds (1.01 k allocations: 1.045 MB)

@alyst
Copy link
Contributor Author

alyst commented Jul 31, 2015

@KristofferC Wow, that made a huge difference!

function bench_comprehension(n::Int) 
    x = collect(1:1000);
    for i in 1:n Bool[(xx  1:100) for xx in x]  end
end
@time bench_comprehension(10^6)
# 2.858 seconds      (1000 k allocations: 1038 MB, 7.97% gc time)
function bench_functor(n::Int) 
    x = collect(1:1000);
    for i in 1:n x .∈ 1:100 end
end
@time bench_functor(10^6)
# 3.460 seconds      (1000 k allocations: 1038 MB, 6.39% gc time)

@@ -380,6 +380,10 @@ const ∈ = in
∋(itr, x)= ∈(x, itr)
∌(itr, x)=!∋(itr, x)

# vectorized ∈ and ∉
(.∈){T}(x::AbstractArray{T}, set) = [xx ∈ set for xx in x]
(.∉){T}(x::AbstractArray{T}, set) = [xx ∉ set for xx in x]
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should produce BitArrays; this should also do broadcasting the way other vectorized ops do.

@alyst
Copy link
Contributor Author

alyst commented Aug 10, 2015

@StefanKarpinski It definitely makes sense, but I need some guidance to make it right.

Unlike the common broadcasting case (e.g. .==), where the operands of the "kernel" operator are of the same nature (and promotable to some common type), in() tests a single element against the collection.
At the moment PR implements only the case when each element of a vector x is tested against a single fixed collection set. As in(), .∈ doesn't declare the type of set (so that any collection that implements in() would be automatically supported). Full vectorization requires "single element vs vector of sets" and "vector of elements vs vector of sets" methods.

Ideally, one would like to have in{S,T}(x::S, y::AbstractCollection{T}) declaration, where S and T could be promoted to some common type and (.∈){S,T}(x::AbstractVector{S}, y::AbstractVector{AbstractCollection{T}}) etc. The first problem is that there's no AbstractCollection type. The second problem is that I don't know how to express that S and T are promotable.

Actually, even the proposed PR cannot discriminate "vector of elements vs single set" and "set vs single collection of sets" cases.

@JeffBezanson
Copy link
Sponsor Member

I'm not a big fan of this. It's a bit confusing to vectorize operations that are already collection operations, and I'd rather not encourage more special-case vector notation (#8450).

@alyst
Copy link
Contributor Author

alyst commented Aug 10, 2015

@JeffBezanson If #8450 would lead to some new syntax for an easy vectorization, it would be fantastic.
The current state is, sort of, ambivalent: the wide unicode palette is made available for writing concise code, but in practice it results in more obscure and redundant lines (some DataFrame-insipred example below):

d[:is_selected] = map(id -> id ∈ sel_ids, d[:id]) & map(t -> t ∈ sel_types, d[:type])

@alyst
Copy link
Contributor Author

alyst commented Sep 11, 2015

So the reaction to .∈ appears to be rather skeptical.
Would it be more appealing if the PR is reduced just to the changes in the parser, so that .∈ is treated as a valid operator syntax? Then it would still be possible to define .∈ in the package(s).

@nalimilan
Copy link
Member

See old issue filed at #5212.

@tkelman
Copy link
Contributor

tkelman commented May 11, 2016

closing since the infix is covered as a special case of #14544, the prefix function form works now with #15032.

@tkelman tkelman closed this May 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants