Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve RNG #67

Closed
JeffBezanson opened this issue Jun 21, 2011 · 28 comments
Closed

improve RNG #67

JeffBezanson opened this issue Jun 21, 2011 · 28 comments
Assignees
Labels
performance Must go faster

Comments

@JeffBezanson
Copy link
Member

We already use mersenne twister, but SFMT/dSFMT is supposed to be faster:

https://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/SFMT/index.html

We also need better seeding.

The work items are

  • Switch to SFMT
  • Move RNG library to external/, so everything under src/ is our own code.
  • Be able to seed from /dev/urandom
  • Save the seed somewhere (wherever it came from) so any run can be reproduced even if you didn't seed manually
@StefanKarpinski
Copy link
Member

For distributed simulations, it seems to me that what you want is to be able to deterministically parameterize an entire run by a single seed value. So for parallel computing, we'd want to ensure that you can do that. I'm not sure what the best way to do that is, however.

@JeffBezanson
Copy link
Member Author

Especially interesting in the case of elastic parallelism!

@ViralBShah
Copy link
Member

How would we do parallel number generation? Just use different seeds to start off, and thus use different parts of the sequence of an RNG. The issue with some of these methods is that the series of numbers generated change with a change in the number of processors - not to mention elastic parallelism.

@ViralBShah
Copy link
Member

Based on the thread on the mailing list, I don't think it is a good idea to seed an RNG from /dev/urandom as the default. It should be a seeding option. Also, it would be nice to have an RNG to generate random numbers from /dev/urandom altogether.

Random numbers quality testsuite: https://www.iro.umontreal.ca/~simardr/testu01/tu01.html

@StefanKarpinski
Copy link
Member

There are two use-cases for random number generation:

  1. I need a deterministic, reproducible, seed-based series of pseudorandom numbers, e.g. for simulation.
  2. I want actual random behavior but have to settle for pseudorandom behavior with a good random seed.

In the former case, you need to explicitly seed the your random number generator so what the generator automatically seeds itself with is irrelevant. In the latter case, you want the default to be to use a good source of entropy — like /dev/urandom — rather than some crappy default like using the current time.

Do you have some better source of entropy that you have in mind for automatically seeding the random number generator when an explicit seed isn't given? Or are you objecting to automatically seeding the generator at all? Personally, if I just want to generate a bunch of random numbers in an ad hoc manner, I really don't want to be bothered to generate a seed — I just want to use it.

@ViralBShah
Copy link
Member

In no system (C, matlab, etc.) is a random number seeded with a random seed of any type by default.

-viral

On Jun 22, 2011, at 12:48 PM, StefanKarpinski wrote:

There are two use-cases for random number generation:

  1. I need a deterministic, reproducible, seed-based series of pseudorandom numbers, e.g. for simulation.
  2. I want actual random behavior but have to settle for pseudorandom behavior with a good random seed.

In the former case, you need to explicitly seed the your random number generator so what the generator automatically seeds itself with is irrelevant. In the latter case, you want the default to be to use a good source of entropy — like /dev/urandom — rather than some crappy default like using the current time.

Do you have some better source of entropy that you have in mind for automatically seeding the random number generator when an explicit seed isn't given? Or are you objecting to automatically seeding the generator at all? Personally, if I just want to generate a bunch of random numbers in an ad hoc manner, I really don't want to be bothered to generate a seed — I just want to use it.

Reply to this email directly or view it on GitHub:
#67 (comment)

@ViralBShah
Copy link
Member

The counter-argument to the "I just want to use it" here is that "I just don't want to be surprised". Seeding an RNG by default has the property that it surprises you. Even saving the random seed automatically is not good enough, because the same .j file will run differently on a different computer where the saved seed is not available.

-viral

On Jun 22, 2011, at 12:48 PM, StefanKarpinski wrote:

There are two use-cases for random number generation:

  1. I need a deterministic, reproducible, seed-based series of pseudorandom numbers, e.g. for simulation.
  2. I want actual random behavior but have to settle for pseudorandom behavior with a good random seed.

In the former case, you need to explicitly seed the your random number generator so what the generator automatically seeds itself with is irrelevant. In the latter case, you want the default to be to use a good source of entropy — like /dev/urandom — rather than some crappy default like using the current time.

Do you have some better source of entropy that you have in mind for automatically seeding the random number generator when an explicit seed isn't given? Or are you objecting to automatically seeding the generator at all? Personally, if I just want to generate a bunch of random numbers in an ad hoc manner, I really don't want to be bothered to generate a seed — I just want to use it.

Reply to this email directly or view it on GitHub:
#67 (comment)

@StefanKarpinski
Copy link
Member

In no system (C, matlab, etc.) is a random number seeded with a random seed of any type by default.

Maybe not in C or Matlab (I checked this), but consider:

$ perl -le 'print rand'
0.481756120528193
$ perl -le 'print rand'
0.305858596301736
$ ruby -le 'print rand'
0.533189391261214
$ ruby -le 'print rand'
0.222633436955373
$ python -c 'import random; print random.random()'
0.177964855511
bash-3.2$ python -c 'import random; print random.random()'
0.191501075101
bash-3.2$ 

All three of the most user-friendly mainstream languages around automatically seed your RNG for you. So the question here is do we emulate C and Matlab or Python, Perl and Ruby. The choice seems pretty clear to me. Honestly, the only viable well-behaved options are:

  • Automatically provide a good, random seed (i.e. use /dev/urandom)
  • Throw an error if random numbers are used without having seeded first

The latter seems unnecessarily pedantic and annoying. How does seeding an RNG by default cause surprise?

@ViralBShah
Copy link
Member

Ok, I didn't know about perl, ruby and python. Neither of these are serious numerical languages, but at least there is precedent.

Seeding an RNG by default differently on every run is surprising because it generates slightly different solutions every time. So, to get predictable answers, I now have to set a seed. When developing, one wants to know if the variations are due to changes made in the code, and possibly a bug, or due to a different stream of random numbers.

-viral

On Jun 23, 2011, at 12:27 AM, StefanKarpinski wrote:

In no system (C, matlab, etc.) is a random number seeded with a random seed of any type by default.

Maybe not in C or Matlab (I checked this), but consider:

$ perl -le 'print rand'
0.481756120528193
$ perl -le 'print rand'
0.305858596301736
$ ruby -le 'print rand'
0.533189391261214
$ ruby -le 'print rand'
0.222633436955373
$ python -c 'import random; print random.random()'
0.177964855511
bash-3.2$ python -c 'import random; print random.random()'
0.191501075101
bash-3.2$ 

So the question here is do we emulate C and Matlab or Python, Perl and Ruby. The choice seems pretty clear to me. Honestly, the only viable well-behaved options are:

  • Automatically provide a good, random seed (i.e. use /dev/urandom)
  • Throw an error if random numbers are used without having seeded first

The latter seems unnecessarily pedantic and annoying. How does seeding an RNG by default cause surprise?

Reply to this email directly or view it on GitHub:
#67 (comment)

@JeffBezanson
Copy link
Member Author

You're right that predictability is needed for debugging, but that can never be achieved automatically --- to reproduce a run you still have to either re-seed manually or restart the system. If you're modifying your code and re-running it without re-seeding, you're going to get different answers every time in every system, unless you restart matlab between runs, which people don't do.

If you start N julia processes to do a simulation, and they all start with the same seed, you'll be doing the exact same thing on each processor without realizing it. It might take a while to notice this. It seems this would hurt the naive user. The expert user will manage their own seeding no matter what we do.

@ViralBShah
Copy link
Member

Are you referring to N julia processes started on the command line, or through julia. If they are part of a "Julia parallel machine" or some such thing, then julia will do the right thing. But, yes, if you are just running a bunch of disconnected julias, then one would have to seed.

I am getting a little convinced.

-viral

On Jun 23, 2011, at 7:34 AM, JeffBezanson wrote:

You're right that predictability is needed for debugging, but that can never be achieved automatically --- to reproduce a run you still have to either re-seed manually or restart the system. If you're modifying your code and re-running it without re-seeding, you're going to get different answers every time in every system, unless you restart matlab between runs, which people don't do.

If you start N julia processes to do a simulation, and they all start with the same seed, you'll be doing the exact same thing on each processor without realizing it. It might take a while to notice this. It seems this would hurt the naive user. The expert user will manage their own seeding no matter what we do.

Reply to this email directly or view it on GitHub:
#67 (comment)

@ViralBShah
Copy link
Member

Any idea what R does?

-viral

On Jun 23, 2011, at 12:27 AM, StefanKarpinski wrote:

In no system (C, matlab, etc.) is a random number seeded with a random seed of any type by default.

Maybe not in C or Matlab (I checked this), but consider:

$ perl -le 'print rand'
0.481756120528193
$ perl -le 'print rand'
0.305858596301736
$ ruby -le 'print rand'
0.533189391261214
$ ruby -le 'print rand'
0.222633436955373
$ python -c 'import random; print random.random()'
0.177964855511
bash-3.2$ python -c 'import random; print random.random()'
0.191501075101
bash-3.2$ 

So the question here is do we emulate C and Matlab or Python, Perl and Ruby. The choice seems pretty clear to me. Honestly, the only viable well-behaved options are:

  • Automatically provide a good, random seed (i.e. use /dev/urandom)
  • Throw an error if random numbers are used without having seeded first

The latter seems unnecessarily pedantic and annoying. How does seeding an RNG by default cause surprise?

Reply to this email directly or view it on GitHub:
#67 (comment)

@JeffBezanson
Copy link
Member Author

Copying R here is a good idea. Stats people swear by it. Then again, R doesn't have the parallelism issue.

@ViralBShah
Copy link
Member

Yes, that's what I thought too - but just a useful data point.

-viral

On Jun 23, 2011, at 7:44 AM, JeffBezanson wrote:

Copying R here is a good idea. Stats people swear by it. Then again, R doesn't have the parallelism issue.

Reply to this email directly or view it on GitHub:
#67 (comment)

@ViralBShah
Copy link
Member

How portable is /dev/urandom. I believe all commonly used unix-y systems should have it today. Not sure what Android/iOS and embedded systems are like, but we have much more to worry about to target those platforms..

-viral

On Jun 23, 2011, at 7:44 AM, JeffBezanson wrote:

Copying R here is a good idea. Stats people swear by it. Then again, R doesn't have the parallelism issue.

Reply to this email directly or view it on GitHub:
#67 (comment)

@ViralBShah
Copy link
Member

Well, this page tells us all. Apparently there is also the Entropy Gathering Daemon for computers without /dev/urandom. EGD Sounds like something that might end the universe or something.

https://en.wikipedia.org/wiki//dev/random

I really think we ought to build the Testu01 testsuite into julia.

-viral

On Jun 23, 2011, at 7:44 AM, JeffBezanson wrote:

Copying R here is a good idea. Stats people swear by it. Then again, R doesn't have the parallelism issue.

Reply to this email directly or view it on GitHub:
#67 (comment)

@StefanKarpinski
Copy link
Member

Bam:

$ echo 'runif(1)' | R
> runif(1)
[1] 0.9168643
> 
$ echo 'runif(1)' | R
> runif(1)
[1] 0.3805377
> 

So R does the same thing as Python, Perl, and Ruby.

@StefanKarpinski
Copy link
Member

/dev/urandom looks pretty damned universal. But that's not really the point: the point is to automatically choose use the best practices instead of making the naive user do something non-default in order to adhere to best practices.

I'm not sure how useful testu01 really would be: it's a test of algorithms, not of implementations as such. We could, however, run our implementation through it since if there's a defect in the implementation, it would probably detect that defect as a deviation from apparent randomness. If it can't detect the defect, one could argue that it's not really a defect since the resulting RNG still appears random.

@StefanKarpinski
Copy link
Member

One interesting thing I thought about (based on that best practices in bioinformatics paper I sent out) was combining one of the cheap, simple non-linear RNGs like the JKISS one described with MT to get something that has the same period as MT, is slightly more expensive, but is non-linear. But I'm a complete amateur in this area, so this might be an incredibly dumb idea.

@StefanKarpinski
Copy link
Member

Apparently there is also the Entropy Gathering Daemon for computers without /dev/urandom. EGD Sounds like something that might end the universe or something.

Entropy Gathering Daemon does sound incredibly ominous. I feel like that should be the evil monster in a kids' book.

@ViralBShah
Copy link
Member

The device is universal, but the implementations differ a fair bit across OSes as per the Wikipedia page. Also, people have shown quality of /dev/random to be poor on some platforms. We could go with it anyways, knowing that it is still likely to be better than other options.

-viral

On Jun 23, 2011, at 8:58 AM, StefanKarpinski wrote:

/dev/urandom looks pretty damned universal. But that's not really the point: the point is to automatically choose use the best practices instead of making the naive user do something non-default in order to adhere to best practices.

I'm not sure how useful testu01 really would be: it's a test of algorithms, not of implementations as such. We could, however, run our implementation through it since if there's a defect in the implementation, it would probably detect that defect as a deviation from apparent randomness. If it can't detect the defect, one could argue that it's not really a defect since the resulting RNG still appears random.

Reply to this email directly or view it on GitHub:
#67 (comment)

@StefanKarpinski
Copy link
Member

We can implement the best practice on each platform. AFAIK, /dev/urandom is fine on both Linux and OS X. On Windows, we'll use that system random bits API and we can use EGD elsewhere or whatever.

ViralBShah added a commit that referenced this issue Jun 29, 2011
@ViralBShah
Copy link
Member

The RNG is also used by flisp. Is it ok to remove all the random stuff from flisp? I can then move it all out of support and into external. My proposal is to have external/random, which includes mt19937ar.c, mt19937-64.c, dsfmt-2.1, and our wrappers that are in random.c. All of these can then be compiled into librandom.so. rand_double() and rand_float() then will get called through ccall.

We can then experiment with various ways to generate random numbers.

If rand_double() and rand_float() are called through ccall, rather than as builtins, will this greatly affect performance? It seems that this library approach is much cleaner.

@JeffBezanson
Copy link
Member Author

rand_double and rand_float are already called through ccall. This is the best approach, because the compiler can inline the ccall into callers of rand() and then there's no overhead at all.

I like your approach, let's do it. It's fine to remove the random stuff from flisp, I'll do that part.

@ghost ghost assigned ViralBShah Jul 2, 2011
@ViralBShah
Copy link
Member

commit 24fc7ed enables the use of DSFMT. Closing this one. A new issue can be opened later when we want to use /dev/urandom and save the seed.

@JeffBezanson
Copy link
Member Author

Now we don't have the ability to seed with a source of entropy anymore. Maybe using the 64-bit system time isn't great, but at least it's something. At the very least we need to expose dSFMT's functions for seeding with more bits than 32.
Also I see that mt19937ar.c is still there; is it used for anything? And what is mt19937-64?

@ViralBShah
Copy link
Member

All still in transition phase. Will get rid of mt19937ar.c and mt19937-64 (64-bit version of mt19937) once everything works. Just need to expose a couple more dSFMT interfaces. Will try to implement /dev/urandom as well.

-viral

On Jul 9, 2011, at 2:54 AM, JeffBezanson wrote:

Now we don't have the ability to seed with a source of entropy anymore. Maybe using the 64-bit system time isn't great, but at least it's something. At the very least we need to expose dSFMT's functions for seeding with more bits than 32.
Also I see that mt19937ar.c is still there; is it used for anything? And what is mt19937-64?

Reply to this email directly or view it on GitHub:
#67 (comment)

@ViralBShah
Copy link
Member

All of this is done in commit c4aa6c7

-viral

On Jul 9, 2011, at 2:54 AM, JeffBezanson wrote:

Now we don't have the ability to seed with a source of entropy anymore. Maybe using the 64-bit system time isn't great, but at least it's something. At the very least we need to expose dSFMT's functions for seeding with more bits than 32.
Also I see that mt19937ar.c is still there; is it used for anything? And what is mt19937-64?

Reply to this email directly or view it on GitHub:
#67 (comment)

StefanKarpinski pushed a commit that referenced this issue Feb 8, 2018
Add a linear eachindex definition
cmcaine pushed a commit to cmcaine/julia that referenced this issue Sep 24, 2020
Fix runtests: testset results print by default
Keno pushed a commit that referenced this issue Oct 9, 2023
Build compiled methods to handle llvmcall
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants