rand(::Set) #16231

stevengj · 2016-05-06T14:46:38Z

As mentioned on the mailing list, it might be nice to have a rand(::Set) function. Since this is hard to implement efficiently without access to the Set internals, it is something that almost has to be in Base. Here is a sample O(1) implementation:

import Base: GLOBAL_RNG, isslotfilled, rand
function rand(r::AbstractRNG, s::Set)
    isempty(s) && throw(ArgumentError("set must be non-empty"))
    n = length(s.dict.slots)
    while true
        i = rand(r, 1:n)
        isslotfilled(s.dict, i) && return s.dict.keys[i]
    end
end
rand(s::Set) = rand(GLOBAL_RNG, s)

If someone wants to put together a test case and documentation, it could make a good intro PR.

Probably also add a rand(::Set, dims) method (etcetera) to supplement rand(::Array, dims), which could be accomplished just by converting rand arguments from AbstractArray to Union{AbstractArray,AbstractSet} in a few places.

The text was updated successfully, but these errors were encountered:

stevengj · 2016-05-06T17:30:25Z

Note that my implementation above is efficient for the usual case that the set's internal hash table is at most a few times larger than the size of the set. You can defeat this by deleting the contents of the set and then re-inserting a much smaller number of elements. For example, my code will be extremely inefficient for:

s = Set(1:10^7)
empty!(s)
push!(s, 1)
length(s.dict.slots) / length(s) # gives 1.7e7!
@time rand(s)

takes almost 1 second for rand(s).

I suppose you could test whether length(s.dict.slots) > somecoefficient * length(s)^2 and use an O(length(s)) algorithm in that case, where somecoefficient is determined by benchmarking the two algorithms and finding (roughly) the crossover point. However, even iteration over the set is very inefficient in this case [it's O(length(s.dict.slots)), not O(length(s)), I think], so there may be no good algorithm, and it may not be worth worrying about.

PythonNut · 2016-05-14T22:58:59Z

I'd like to take a look at this, if you don't mind.

stevengj · 2016-05-15T21:52:58Z

@PythonNut, that would be great.

PythonNut · 2016-05-16T05:09:06Z

@stevengj

You can defeat this by deleting the contents of the set and then re-inserting a much smaller number of elements.

I'm new to the Julia scene, but this strikes me as something that has performance implications outside this issue. You mention iteration, and I wouldn't be surprised if other operations would also be impeded by large numbers of empty bins.

One possible solution is to rehash! (but in the down direction, which isn't currently implemented, AFAICT) when length(s.dict.slots)/length(s) gets too large. This also saves space, which is nice. Performance obviously takes a hit if you're constantly varying the size of the hash table by large factors, but that sounds somewhat uncommon (idk. is it really?).

Does this sound like a better path forward?

stevengj · 2016-05-16T12:14:17Z

I think it would be good to fix Base.rehash! so that it can shrink the hash table if needed.

I don't know about automating this; the need for that seems pretty rare. Typical uses of Set seem to grow but almost never shrink them (except to empty and refill to about the same size).

(We should also check what e.g. Python does in this case.)

Anyway, that should be a separate issue and a separate PR.

PythonNut · 2016-05-19T02:22:52Z

@stevengj that sounds quite reasonable.

I've run some profiling, the results of which are visualized here:

O(1) is the algorithm you supplied
O(n) is rand(collect(s))

v0.4.5 Fedora x86_64 i5-2520M

v0.4.5 Fedora i386 Core2Duo T2500

Conclusion

As you can see, the crossover points vary with the size of the set (in addition to the number of bins), and also vary by machine. In addition, the O(n) solution only beats your O(1) solution when the proportion of unfilled bins is very large (>99.9% in all tests). Do you still think trying to nail the threshold down with more testing is the right way to go?

stevengj · 2016-05-19T13:48:58Z

@PythonNut, nice job. In principle, one could do a bit better than rand(collect(s)) — do i = rand(1:length(s)); state = start(s); for k = 1:i-1; _, state = next(s, state); end; return next(s, state)[1] — since this will avoid an allocation of a temporary array.

But I doubt it will change the crossover point by orders of magnitude. With such a high crossover point, I would only bother to implement the O(1) algorithm.

stevengj · 2016-05-19T13:51:19Z

Note also that rather than tic/toq, you can benchmark an expression e with t = @elapsed e.

PythonNut · 2016-05-19T14:56:01Z

@stevengj yes, looks like the effect isn't that significant.

v0.4.5 Arch Linux x86_64 i7-4712HQ

Allocation avoiding technique

Original technique

I understand that your technique has the additional advantage of short circuiting the collection once the value is found, so it performs better on average. However, they have almost identical worst-case times, which I can't explain since we should be saving a bit of allocation. ╮(╯_╰)╭

Note also that rather than tic/toq, you can benchmark an expression e with t = @elapsed e.

Thanks. I was looking for something like that, but couldn't find it for some reason.

I guess I'll transition to copying your implementation over and writing tests. It might be a while, since I'm new to the julia codebase.

KristofferC · 2016-05-19T14:59:56Z

Your thoroughness is inspiring.

PythonNut · 2016-05-19T18:43:12Z

@stevengj would you also like a rand(::Dict) method?

stevengj · 2016-05-20T02:14:22Z

rand(::Dict) would return a key => value pair, I guess? That might make sense, since then rand(s::Set) could just call rand(s.dict)[1] so you get two methods for the price of one.

PythonNut · 2016-05-22T21:43:24Z

@stevengj Yup, that's what I was thinking.

Sorry, I'm having some trouble building my development Julia since I don't have much disk space. :/ Don't worry, I'll think of something.

PythonNut · 2016-06-06T23:16:14Z

Sorry guys, it looks like I'll need a much bigger VM to build Julia, and I haven't got around to making it yet—my job is making it hard.

Hopefully, I get a spare moment to build it and start work on the tests for this.

Performant random implementation for Dict and Set types.

fixes JuliaLang#16231

stevengj added the good first issue Indicates a good issue for first-time contributors to Julia label May 6, 2016

cossio mentioned this issue May 6, 2016

function to choose element at random from an array #3075

Closed

ivarne added the randomness Random number generation and the Random stdlib label May 19, 2016

dbeach24 added a commit to dbeach24/julia that referenced this issue Aug 20, 2016

Fix for JuliaLang#16231

e11ca78

Performant random implementation for Dict and Set types.

dbeach24 mentioned this issue Aug 20, 2016

Implement rand(::Dict) and rand(::Set) (fix #16231) #18155

Merged

StefanKarpinski closed this as completed in 4f88b52 Aug 20, 2016

mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016

fast rand(::Dict) and rand(::Set) (JuliaLang#18155)

3ce15ec

fixes JuliaLang#16231

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rand(::Set) #16231

rand(::Set) #16231

stevengj commented May 6, 2016 •

edited

Loading

stevengj commented May 6, 2016 •

edited

Loading

PythonNut commented May 14, 2016

stevengj commented May 15, 2016

PythonNut commented May 16, 2016 •

edited

Loading

stevengj commented May 16, 2016 •

edited

Loading

PythonNut commented May 19, 2016 •

edited

Loading

stevengj commented May 19, 2016 •

edited

Loading

stevengj commented May 19, 2016

PythonNut commented May 19, 2016 •

edited

Loading

KristofferC commented May 19, 2016

PythonNut commented May 19, 2016

stevengj commented May 20, 2016

PythonNut commented May 22, 2016

PythonNut commented Jun 6, 2016

rand(::Set) #16231

rand(::Set) #16231

Comments

stevengj commented May 6, 2016 • edited Loading

stevengj commented May 6, 2016 • edited Loading

PythonNut commented May 14, 2016

stevengj commented May 15, 2016

PythonNut commented May 16, 2016 • edited Loading

stevengj commented May 16, 2016 • edited Loading

PythonNut commented May 19, 2016 • edited Loading

v0.4.5 Fedora x86_64 i5-2520M

v0.4.5 Fedora i386 Core2Duo T2500

Conclusion

stevengj commented May 19, 2016 • edited Loading

stevengj commented May 19, 2016

PythonNut commented May 19, 2016 • edited Loading

v0.4.5 Arch Linux x86_64 i7-4712HQ

KristofferC commented May 19, 2016

PythonNut commented May 19, 2016

stevengj commented May 20, 2016

PythonNut commented May 22, 2016

PythonNut commented Jun 6, 2016

stevengj commented May 6, 2016 •

edited

Loading

stevengj commented May 6, 2016 •

edited

Loading

PythonNut commented May 16, 2016 •

edited

Loading

stevengj commented May 16, 2016 •

edited

Loading

PythonNut commented May 19, 2016 •

edited

Loading

stevengj commented May 19, 2016 •

edited

Loading

PythonNut commented May 19, 2016 •

edited

Loading