Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rand(::Set) #16231

Closed
stevengj opened this issue May 6, 2016 · 14 comments
Closed

rand(::Set) #16231

stevengj opened this issue May 6, 2016 · 14 comments
Labels
good first issue Indicates a good issue for first-time contributors to Julia randomness Random number generation and the Random stdlib

Comments

@stevengj
Copy link
Member

stevengj commented May 6, 2016

As mentioned on the mailing list, it might be nice to have a rand(::Set) function. Since this is hard to implement efficiently without access to the Set internals, it is something that almost has to be in Base. Here is a sample O(1) implementation:

import Base: GLOBAL_RNG, isslotfilled, rand
function rand(r::AbstractRNG, s::Set)
    isempty(s) && throw(ArgumentError("set must be non-empty"))
    n = length(s.dict.slots)
    while true
        i = rand(r, 1:n)
        isslotfilled(s.dict, i) && return s.dict.keys[i]
    end
end
rand(s::Set) = rand(GLOBAL_RNG, s)

If someone wants to put together a test case and documentation, it could make a good intro PR.

Probably also add a rand(::Set, dims) method (etcetera) to supplement rand(::Array, dims), which could be accomplished just by converting rand arguments from AbstractArray to Union{AbstractArray,AbstractSet} in a few places.

@stevengj stevengj added the good first issue Indicates a good issue for first-time contributors to Julia label May 6, 2016
@stevengj
Copy link
Member Author

stevengj commented May 6, 2016

Note that my implementation above is efficient for the usual case that the set's internal hash table is at most a few times larger than the size of the set. You can defeat this by deleting the contents of the set and then re-inserting a much smaller number of elements. For example, my code will be extremely inefficient for:

s = Set(1:10^7)
empty!(s)
push!(s, 1)
length(s.dict.slots) / length(s) # gives 1.7e7!
@time rand(s)

takes almost 1 second for rand(s).

I suppose you could test whether length(s.dict.slots) > somecoefficient * length(s)^2 and use an O(length(s)) algorithm in that case, where somecoefficient is determined by benchmarking the two algorithms and finding (roughly) the crossover point. However, even iteration over the set is very inefficient in this case [it's O(length(s.dict.slots)), not O(length(s)), I think], so there may be no good algorithm, and it may not be worth worrying about.

@PythonNut
Copy link
Contributor

I'd like to take a look at this, if you don't mind.

@stevengj
Copy link
Member Author

@PythonNut, that would be great.

@PythonNut
Copy link
Contributor

PythonNut commented May 16, 2016

@stevengj

You can defeat this by deleting the contents of the set and then re-inserting a much smaller number of elements.

I'm new to the Julia scene, but this strikes me as something that has performance implications outside this issue. You mention iteration, and I wouldn't be surprised if other operations would also be impeded by large numbers of empty bins.

One possible solution is to rehash! (but in the down direction, which isn't currently implemented, AFAICT) when length(s.dict.slots)/length(s) gets too large. This also saves space, which is nice. Performance obviously takes a hit if you're constantly varying the size of the hash table by large factors, but that sounds somewhat uncommon (idk. is it really?).

Does this sound like a better path forward?

@stevengj
Copy link
Member Author

stevengj commented May 16, 2016

I think it would be good to fix Base.rehash! so that it can shrink the hash table if needed.

I don't know about automating this; the need for that seems pretty rare. Typical uses of Set seem to grow but almost never shrink them (except to empty and refill to about the same size).

(We should also check what e.g. Python does in this case.)

Anyway, that should be a separate issue and a separate PR.

@PythonNut
Copy link
Contributor

PythonNut commented May 19, 2016

@stevengj that sounds quite reasonable.

I've run some profiling, the results of which are visualized here:

  • O(1) is the algorithm you supplied
  • O(n) is rand(collect(s))

v0.4.5 Fedora x86_64 i5-2520M

set_rand_perf_100k
set_rand_perf_1m
set_rand_perf_10m

v0.4.5 Fedora i386 Core2Duo T2500

set_rand_perf_100k
set_rand_perf_1m
set_rand_perf_10m

Conclusion

As you can see, the crossover points vary with the size of the set (in addition to the number of bins), and also vary by machine. In addition, the O(n) solution only beats your O(1) solution when the proportion of unfilled bins is very large (>99.9% in all tests). Do you still think trying to nail the threshold down with more testing is the right way to go?

@ivarne ivarne added the randomness Random number generation and the Random stdlib label May 19, 2016
@stevengj
Copy link
Member Author

stevengj commented May 19, 2016

@PythonNut, nice job. In principle, one could do a bit better than rand(collect(s)) — do i = rand(1:length(s)); state = start(s); for k = 1:i-1; _, state = next(s, state); end; return next(s, state)[1] — since this will avoid an allocation of a temporary array.

But I doubt it will change the crossover point by orders of magnitude. With such a high crossover point, I would only bother to implement the O(1) algorithm.

@stevengj
Copy link
Member Author

Note also that rather than tic/toq, you can benchmark an expression e with t = @elapsed e.

@PythonNut
Copy link
Contributor

PythonNut commented May 19, 2016

@stevengj yes, looks like the effect isn't that significant.

v0.4.5 Arch Linux x86_64 i7-4712HQ

Allocation avoiding technique
set_rand_perf_1m

Original technique
protogon_set_rand_perf_1m

I understand that your technique has the additional advantage of short circuiting the collection once the value is found, so it performs better on average. However, they have almost identical worst-case times, which I can't explain since we should be saving a bit of allocation. ╮(╯_╰)╭

Note also that rather than tic/toq, you can benchmark an expression e with t = @elapsed e.

Thanks. I was looking for something like that, but couldn't find it for some reason.

I guess I'll transition to copying your implementation over and writing tests. It might be a while, since I'm new to the julia codebase.

@KristofferC
Copy link
Sponsor Member

Your thoroughness is inspiring.

@PythonNut
Copy link
Contributor

@stevengj would you also like a rand(::Dict) method?

@stevengj
Copy link
Member Author

rand(::Dict) would return a key => value pair, I guess? That might make sense, since then rand(s::Set) could just call rand(s.dict)[1] so you get two methods for the price of one.

@PythonNut
Copy link
Contributor

@stevengj Yup, that's what I was thinking.

Sorry, I'm having some trouble building my development Julia since I don't have much disk space. :/ Don't worry, I'll think of something.

@PythonNut
Copy link
Contributor

Sorry guys, it looks like I'll need a much bigger VM to build Julia, and I haven't got around to making it yet—my job is making it hard.

Hopefully, I get a spare moment to build it and start work on the tests for this.

dbeach24 added a commit to dbeach24/julia that referenced this issue Aug 20, 2016
Performant random implementation for Dict and Set types.
mfasi pushed a commit to mfasi/julia that referenced this issue Sep 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indicates a good issue for first-time contributors to Julia randomness Random number generation and the Random stdlib
Projects
None yet
Development

No branches or pull requests

4 participants