Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed-up randperm, randcycle and shuffle #14772

Merged
merged 1 commit into from
Jan 27, 2016

Conversation

rfourquet
Copy link
Member

Cf. https://groups.google.com/forum/#!topic/julia-users/XxYcgNuHtJs.
This uses a less general but more efficient version of random generation within a range.

Note that amortizing the cost of creating RangeGenerator objects offers substantial benefits, but insufficient compared to this PR, e.g: randperm(10^7) takes here 0.92s on master, 0.47s when amortizing, and 0.23 with this PR (and 0.30s with the proposed R-based solution in the julia-users thread).

Note: it may be worth revisiting range generation, this PR is more of a stop-gap solution.

@rfourquet
Copy link
Member Author

I hadn't run the full tests locally, an easy fix for the failure of travis is to change the value of the array at line 951 of test/sparsedir/sparse.jl. After that, I get a failed test at line 1102 of the same file... I'm not sure why, and don't have time to investigate now (help welcome, I don't know this code).

@johnmyleswhite
Copy link
Member

Since you've read the R translation in that thread, I fear that we can't merge this code because it's tainted by the GPL.

@ViralBShah
Copy link
Member

This implementation has nothing in common with the posted R solution. Why is it considered tainted?

@johnmyleswhite
Copy link
Member

As far as I understand the logic of clean room implementations, no one who has read that thread can write code for this function now. By linking to the thread containing GPL code, there's prima facie evidence that the author has read the GPL code.

It's possible that an implementation that is provably non-derivative isn't tainted, but it's not clear to me what a judge would think of that argument.

@ViralBShah
Copy link
Member

OK. But this is not even a clean room implementation of what was discussed in that thread. It optimizes Julia's rand functions in the existing Julia implementation of randperm.

@johnmyleswhite
Copy link
Member

It's very possible I'm just being paranoid. I'm just worried by the acknowledgment of reading GPL code for functionality that's being updated.

@StefanKarpinski
Copy link
Sponsor Member

This code is completely unrelated to the GPL code that was posted in that thread and is fine legally.

@rfourquet
Copy link
Member Author

I opened an issue #14778 for the above mentioned test failure, which AFAICS is unrelated to this PR. By commenting out the failing line, all the tests pass locally. Should I comment it out here to have green tests, or rather wait for that other issue to be fixed first (I'm helpless about it) ?

@rfourquet
Copy link
Member Author

So I just pushed a new version (with the failing test commented out), as I updated slightly the rest of the code.

@ViralBShah
Copy link
Member

I will take a look at the failing test. Let's give a couple of days to see if we can sort it out.

ViralBShah added a commit that referenced this pull request Jan 27, 2016
speed-up randperm, randcycle and shuffle
@ViralBShah ViralBShah merged commit 5ac7f75 into JuliaLang:master Jan 27, 2016
@ViralBShah
Copy link
Member

@rfourquet Do we have an open issue on range generation that should replace this stopgap solution? Would be great - just to serve as a reminder.

@rfourquet
Copy link
Member Author

I don't think there is an open issue, but I had this in mind for a while (since some random generation functions became faster). Actually, I intended to try that on top of a 1+ year old branch which accelerates random generation of integers, but which depends on a solution like #9174 (btw, something like #9174 (comment) may also be useful for #14536), as I don't want rand(Int) to allocate in some cases.

@StefanKarpinski
Copy link
Sponsor Member

@rfourquet, would you mind opening an issue saying summarizing this state just to keep track of this?

@rfourquet
Copy link
Member Author

I could do so yes, but not sure if it makes sense as there is no real issue or bug, besides a slight duplication of the code. To be honest, I'm not certain that we can improve the efficiency of random generation on ranges in a general way, such that we can use it for randperm (i.e. having the general solution as fast as the optimized version for randperm).

So I wait for your confirmation if you think it's worth having this issue, in the lines of "rand(Int) can be sped up by using an array in MersenneTwister, this depends on avoiding pointer_to_array allocations, and then check if this can be used to improve rand on a ranges".

@rfourquet rfourquet deleted the rf/randperm/rand_lt branch January 29, 2016 05:36
@ViralBShah
Copy link
Member

Avoiding pointer_to_array is definitely the first thing. Should we rebase that PR and see if we can get it merged? I think it is worth trying.

@GunnarFarneback
Copy link
Contributor

While not breaking, this change produces different shuffles and permutations than 0.4, which can be inconvenient when you want to verify that your code gives the same results on 0.5 or compare the speed. I think it would be helpful to mention this somewhere in NEWS.md.

@tkelman
Copy link
Contributor

tkelman commented Aug 3, 2016

@GunnarFarneback could you propose a wording for such a mention?

@GunnarFarneback
Copy link
Contributor

Something like "shuffle, shuffle!, randperm, and randcycle have been switched to faster algorithms. As a side-effect they produce different results compared to 0.4 even if the random seed is the same."

The reference to the random seed is a simplification of the actual interaction with the RNG but should be good enough to convey the message that results will become different.

@rfourquet rfourquet added the randomness Random number generation and the Random stdlib label May 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants