Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random: allow negative seeds #51416

Merged
merged 2 commits into from
Sep 29, 2023
Merged

Random: allow negative seeds #51416

merged 2 commits into from
Sep 29, 2023

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Sep 21, 2023

Alternative to #46190, see that PR for background. There isn't a strong use-case for accepting negative seeds, but probably many people tried something like seed = rand(Int); seed!(rng, seed) and saw it failing. As it's easy to support, let's do it.

This might "break" some random streams, those for which the upper bit of make_seed(seed)[end] was set, so it's rare.

@rfourquet rfourquet added the domain:randomness Random number generation and the Random stdlib label Sep 21, 2023
seed = UInt32[]
while true
# we directly encode the bit pattern of `abs(n)` into the resulting vector `seed`;
# to greatly limit breaking the streams of random numbers, we encode the sign bit
Copy link
Contributor

@Seelengrab Seelengrab Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stream of random numbers is documented not to be stable across versions, even for the same seed. This seems like a lot of additional code to avoid breaking something we don't guarantee in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely break the random streams, but it can be some work then to fix the seeds in LinearAlgebra! And probably for some people in the wild. But I don't think that not breaking most seeds didn't represent much more code; the alternatives I saw were:

  1. storing the sign bit as the first bit of seed[0], which would require shifting
  2. storing the sign bit in a dedicated UInt32, e.g. in seed[1] or seed[end] (probably the simplest option)
  3. storing a hard coded string of random UInt32 numbers at the beginning of seed! to encode negativity (e.g. UInt32[0xbc7f82b8, 0x53f967d9, 0x538ab2e4, 0x7c8725a9]); this would also have been "non-stream-breaking"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the other PR, I still don't understand why the sign bit needs to be treated specially at all. It's a bit. make_seed certainly doesn't make use of the fact that what's passed in is a mathematical number; it only cares for the bits. Why is this one bit special to the purpose of make_seed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already explained at length my reasonning in the other PR, I'm afraid I don't have much more ideas on how to explain this.

A Julia integer can be seen as a bag of bits, but it's first and foremost an object representing a mathematical number. I simply want (1) n == m implies make_seed(n) == make_seed(m), and (2) n != m implies make_seed(n) != make_seed(m), without caring about the bits which encode n or m. I just need to care about the bits (in particular the sign bit) in the implementation, which must simply do what it takes to honor these properties.

Property (1) is what we expect from mathematical functions, and property (2) is just useful for seed!: after all we have the following in the docstring for MersenneTwister:

Different RNG objects can have their own seeds, which may be useful for generating different streams
of random numbers

Yes, with different seeds, we expect different streams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My objection is with the interpretation that make_seed takes a mathematical integer and maps that to a stream of numbers. It doesn't - it takes the given bag-of-bits and initializes the RNG state with that. A Julia Int(8/16/32/64/128) (which are really from a modular ring, not the entire integers) is just the most convenient & packed representation of that we have.

This is necessarily an operation that can't produce unique RNG objects for every possible seed, by the pigeonhole principle: the state of the PRNG is of finite size. You can of course say that you only really care about finitely sized inputs (so Base.BitIntegers, basically), but then you also have to prove that those different inputs not only produce different RNG initial streams, but that the stream of numbers (and the associated states) doesn't overlap (otherwise you're on a shared cycle). This is already not the case for any RNG based on LFSR, which merely has long cycle lengths.

So, with that in mind, I don't understand why those two properties in particular are desirable, other than to give the illusion of uncorrelated RNG streams. I get that the properties seem nice, but you don't really get much out of them algorithmically. Documenting that make_seed by default cares only about the bits of whatever object its given seems just as simple to document & easier to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting that make_seed by default cares only about the bits of whatever object its given seems just as simple to document & easier to maintain.

I agree it might be easier to maintain, but not by much as this PR shows. But also, again, I see no particular reason to treat integers as bag of bits: it's only an implementation detail of make_seed. Mostly everywhere else, we treat julia integers as mathematical integers as much as reasonably possible. hash for example does that.

Besides, by just applying unsigned to the input like in #46190, we still have to handle BigInt separately, and this wouldn't even address the case of other non-bits integers in the wild. Here make_seed will work for any integer implementation providing a view of their instances as two's complement (which BigInt does).

There are two steps in the process:

  1. map integers to vector seeds suitable to initialize RNGs
  2. map vectors from 1) into random streams

If the two properties (P1) mathematical function and (P2) injectivity, i.e. x == y iff f(x) == f(y), hold for each step, then it holds for the whole process.
This PR addresses only 1), and implements P1 and P2.

For 2), (P1) is already generally implemented: for a given initialization vector, the RNGs implemented by Random are initialized into a deterministic state.
For (P2), of course, as already mentioned in the other PR, true injectivity can't hold, but we get it in practice by using SHA2_256.

but then you also have to prove that those different inputs not only produce different RNG initial streams, but that the stream of numbers (and the associated states) doesn't overlap

This is a good point, that users should be aware of; though I don't have to prove anything, I'm not trying to solve this here, only trying to do the best possible thing for seeding. To be certain to avoid overlapping streams, the only way I know of (for the RNGs we have) would be to use "jumping", which is already implemented for MersenneTwister, and there is an open PR for Xoshiro. For users not using jumps, I think the best we can do is to make seed!(::Integer) produce different streams for different inputs (and the same streams for == inputs). In practice, two distinct streams will be extremely unlikely to overlap during program execution. For example, if i'm not mistaken, this paper (i could only read the intro, but it gives the formula) suggests that for n processors each producing sequences of lenght L for a state with period P, "the probability of overlap is always at most n^2 * L / P". For example, setting n = big(10.0)^4; L = big(2.0)^50; P = big(2.0)^256 (Xoshiro has period 2^256-1), this formula gives a probability of getting overlapping sequences upper bounded by 2^(-179), i.e. were we to trust this paper, with these parameters above, it's more likely to find a collision in SHA2_256 than to get overlapping sequences with Xoshiro.

I don't understand why those two properties in particular are desirable, other than to give the illusion of uncorrelated RNG streams

It's maybe an illusion in theory, but with the arguments above, i believe that for all practical matters, these properties go a long way into ensuring uncorrelated streams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly everywhere else, we treat julia integers as mathematical integers as much as reasonably possible.

No; Julia integers are not mathematical integers (i.e., the set of natural numbers, if that's what you mean). They are modular integers.. make_seed ignores that in any case, and is only interested in the "bag of bits" behavior of them - otherwise operations like & 0xffffffff wouldn't make sense in the first place, and would have to be written like % 0x100000000 instead.

Besides, by just applying unsigned to the input like in #46190, we still have to handle BigInt separately, and this wouldn't even address the case of other non-bits integers in the wild. Here make_seed will work for any integer implementation providing a view of their instances as two's complement (which BigInt does).

Naive conversion via unsigned still propagates the sign bit:

julia> reinterpret(Int8, 0xff) |> unsigned
0xff # _not_ 0xef!

which is exactly why I chose it in the original PR, in order to not have to take special considerations. So by special casing signbit, you've introduced a bias where the sign bit is counted twice if it's set (once during the conversion to unsigned, then again in your if neg), meaning it contributes more to the seeded value than the other bits. That's why it's dangerous to do this kind of special casing, and why it should be avoided at all costs.

There are two steps in the process:

  1. map integers to vector seeds suitable to initialize RNGs

"suitable" is doing a lot of heavy lifting here. Why is the direct interpretation of "bag of bits" not suitable - other than losing numerical interpretation, which make_seed has no history of making in the first place?

  1. map vectors from 1) into random streams
    If the two properties (P1) mathematical function and (P2) injectivity, i.e. x == y iff f(x) == f(y), hold for each step, then it holds for the whole process.

Could you expand on what you mean with "(P1) mathematical function"? That is not a testable property of a piece of code, as far as I'm aware. What criterion makes a function "mathematical"?

For (P2), of course, as already mentioned in the other PR, true injectivity can't hold, but we get it in practice by using SHA2_256.

SHA cannot make a non-injective input injective overall. The failure of injectivity already happened before SHA ever saw any input; it cannot "disentangle" that without additional information, which is not provided as far as I can tell.

or users not using jumps, I think the best we can do is to make seed!(::Integer) produce different streams for different inputs (and the same streams for == inputs).

Yes, I agree with this. But I don't agree with "different inputs" meaning "different numerical interpretations of a string of bits" for the purpose of seeding an RNG. make_seed and RNGs in general do not care about numerical interpretation of the input seed; they only care about the raw bits, without interpretation.

To be certain to avoid overlapping streams, the only way I know of (for the RNGs we have) would be to use "jumping", which is already implemented for MersenneTwister, and there is an open PR for Xoshiro. For users not using jumps, I think the best we can do is to make seed!(::Integer) produce different streams for different inputs (and the same streams for == inputs).

I don't follow here. This is not about jumping or multiprocessing at all - this is about what make_seed interprets its input as. I'm arguing that it should only care about the bits, not the type/domain. You're arguing for the opposite. Any existing user will be wholly unaffected by this (or the other PR), because the existing working inputs don't suddenly overlap with existing working inputs. It's only non-working inputs that may overlap with existing working inputs (which is, again, inevitable due to the nature of finite space).

Further, any user relying on task-based multithreading in julia will (or does) already benefit from jumping, both in MT as well as Xoshiro; the tasks are already seeded with a jumped value from the parent RNG.

It's maybe an illusion in theory, but with the arguments above, i believe that for all practical matters, these properties go a long way into ensuring uncorrelated streams.

No, it's also an illusion in practice, and one we should document & make visible, so that people can avoid shooting themselves in the foot by accident, by encouraging use of jumping and heavily discouraging seeding multiple RNGs. The correct procedure for getting multiple statistically uncorrelated RNG streams is to seed one parent RNG and create multiple sub-RNGs by jumping. Seeding multiple RNGs and hoping for uncorrelation is not sound.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_seed ignores that in any case, and is only interested in the "bag of bits" behavior of them - otherwise operations like & 0xffffffff wouldn't make sense in the first place, and would have to be written like % 0x100000000 instead.

I don't understand what makes you say that make_seed is only interested in the bag of bits. The current implementation can lead to both interpretations. the fact that we use bitwise operations is just an implementation detail: they are the tools we have to dissect and assemble integers; I again concede that for that we need to make the assumption that these integer types expose a two's complement "interface", i.e. bitwise operation.

Naive conversion via unsigned still propagates the sign bit:

I dont' understand this point.

you've introduced a bias where the sign bit is counted twice if it's set (once during the conversion to unsigned,

I believe not: one way to see that is that, unless of course the implementation in this PR is faulty, it implements the following: x == y iff f(x) == f(y). Now consider the path taken for x::BigInt: abs(x) can't be <0, and no unsigned call is involved. Therefore, for BigInt, the sign bit is not counted twice: it's removed from x when taking abs, and is "tranferred" into neg, which is then used to set one bit in the resulting array. Now for any non-BigInt y equal to x, we have make_seed(x) == make_seed(y), so the sign bit of y is not counted twice.

then again in your if neg), meaning it contributes more to the seeded value than the other bits. That's why it's dangerous to do this kind of special casing, and why it should be avoided at all costs.

But even then, even if the sign bit was counted twice, for whatever that means, I fail to see what problem this would lead to.

Why is the direct interpretation of "bag of bits" not suitable - other than losing numerical interpretation,

I think it's suitable to a great extent, but I like the interpretation as "math integers" better here, as for example seed(-1) will lead to the same stream on 32 vs 64 bits platforms.

which make_seed has no history of making in the first place?

As an anecdote, make_seed was initially accepting negative numbers (67f6fee), but was changed shortly after (3c5946c). Tere is not a lot of information in these commits, but I would guess the change was precisely to get consistent streams across 32 and 64 bits platforms (but i could be totally wrong there).

Could you expand on what you mean with "(P1) mathematical function"? That is not a testable property of a piece of code, as far as I'm aware. What criterion makes a function "mathematical"?

Sorry I was too terse I guess. The property x == y iff f(x) == f(y) contains
(P1) "mathematical function", i.e. x == y implies f(x) == f(y)
(P2) injectivity, i.e. f(x) == f(y) implies x == y

So rand(x::Integer) is definitely not a "mathematical function", but hash(x) is. Likewise, I want make_seed to be mathematical.

SHA cannot make a non-injective input injective overall. The failure of injectivity already happened before SHA ever saw any input; it cannot "disentangle" that without additional information, which is not provided as far as I can tell.

I don't get this. The pipeline being x |> make_seed |> SHA, if SHA is injective and if make_seed is injective, then SHA ∘ make_seed, is injective, i.e. seed! essentially.

I'm arguing that it should only care about the bits, not the type/domain

Yes and I don't see any advantage to that, except slightly simpler code if we didn't bother to accept BigInt; but BigInt must be supported.

It's only non-working inputs that may overlap with existing working inputs (which is, again, inevitable due to the nature of finite space).

But the big difference is that with the other PR using unsigned, it's trivial to construct two distinct (!=) seeds which produce the same streams, whereas the solution proposed here make this practically impossible.

No, it's also an illusion in practice, and one we should document & make visible, so that people can avoid shooting themselves in the foot by accident, by encouraging use of jumping and heavily discouraging seeding multiple RNGs

I'm all for making jumping more visible, and finding it useful, I even implemented the jumping code for julia's MersenneTwister. But at the same time, users will sometimes just not care about it, because seeding a bunch of integers just works in practice, e.g. Xoshiro.((1, 2, 3, 4)). I'm not an expert on the subject, but I tried to provide some justification for why the 4 resulting Xoshiro instances have such an astronomically small chance of running into overlapping sequence that it's impossible in practice. In other words, barring bugs, I believe no-one will ever run into overlapping sequences by accident when using distinct integer seeds. On the other hand, it's very easy to run in the same seqence from two distinct seeds if we interpret them as bag of bits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now for any non-BigInt y equal to x, we have make_seed(x) == make_seed(y), so the sign bit of y is not counted twice.

It is counted twice:

julia> make_seed(0x80000001)
2-element Vector{UInt32}:
 0x80000001
 0x00000000

Once in propagation of the original data (hence the leading 8), and once when pushing the additional zero! You're "extracting" two bits of information from a single input bit, whereas the other bits only contribute a single bit themselves. That is a bias in favor of that single bit, which ends up contributing 33 bits of "information" to the input of SHA256 (which is always going to be zero, or not exist at all - meaning, it might as well not be there!) There is no additional entropy being added; all you're doing is mapping to a different set of SHA256 outputs due to the Unsignedness (on the type level) of the input.

So rand(x::Integer) is definitely not a "mathematical function", but hash(x) is. Likewise, I want make_seed to be mathematical.

I don't get this. The pipeline being x |> make_seed |> SHA, if SHA is injective and if make_seed is injective, then SHA ∘ make_seed, is injective, i.e. seed! essentially.

Those are both "mathematical functions". Both map an input to an output. What you're describing with P1 and P2 sounds to me like bijectivity, a property some mathematical functions (i.e., functions) have. Since we've already established in the other PR that this function cannot be injective (and thus not bijective) precisely because of BigInt, it seems odd to me to want to cling to this.

I think it's suitable to a great extent, but I like the interpretation as "math integers" better here, as for example seed(-1) will lead to the same stream on 32 vs 64 bits platforms.

Why is that important? We intentionally barely guarantee anything about randomness. Why guarantee this? Why should a user expect this to be the case, when in tons of other contexts (e.g. hashing) we don't provide stableness across architectures at all?

But the big difference is that with the other PR using unsigned, it's trivial to construct two distinct (!=) seeds which produce the same streams, whereas the solution proposed here make this practically impossible.

No, it's just as trivial. Just create two random integers, convert them to BigInt, multiply them and use them as the seed. Because we're going through SHA256 later on, the larger input will be compressed into a fixed length and by the pigeonhole principle, we're pretty much guaranteed to have a collision for sufficiently large BigInt. Supporting BigInt just means throwing any sort of injectivity right out of the window, if we intend to then go through SHA.

In other words, barring bugs, I believe no-one will ever run into overlapping sequences by accident when using distinct integer seeds. On the other hand, it's very easy to run in the same seqence from two distinct seeds if we interpret them as bag of bits.

So you mean to tell me that it's more likely that a user runs into an overlap between 0x80 vs. -1, no matter the interpretation? I think this is a false equivalence; if the input bits are the exact same distribution, both approaches lead to equally likely collisions (again, by pigeonhole). There simply aren't magically more bits; signedness is not special.

Not to mention that due to SHA, "by accident" is also not going to work for the approach in the other PR. Are there bitpatterns for signed integers that overlap with the results from unsigned integers? Yes, but this PR does nothing to prevent that, save for pushing an additional zero(UInt32) when the input is ::Unsigned (which would be easy to add to the other PR as well, which would still also result in biasing).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once in propagation of the original data (hence the leading 8), and once when pushing the additional zero! You're "extracting" two bits of information from a single input bit, whereas the other bits only contribute a single bit themselves. That is a bias in favor of that single bit, which ends up contributing 33 bits of "information" to the input of SHA256

Oh I think i now see what you mean. But no, when interpreting julia integers as mathematical integers, I'm not counting the sign bit twice: a UInt32 needs 33 bits to be encoded as a math integers: 32 bits for the absolute value given by the UInt32, and one bit for the sign. IOW, in UInt32, the sign bit is implicit and doesn't require an explicit storage bit to store this information, because the information is in the type itself. make_seed is nothing more than an encoder for arbitrary-length integers, in particular the set of possible outputs is isomorphic to the set of BigInts. This transformation is useful only insofar as RNGs don't digest BigInt well, they like to eat arrays of unsigned. When you pass a value of type UInt or UInt32 or BigInt etc into make_seed, the input type is erased, only the mathematical value is taken into account. In particular, the sign of the value 0x80000001 must be encoded in this array representation, if I want to implement properties P1 and P2.

But anyway, I don't actually make much sense of "You're "extracting" two bits of information from a single input bit", for a single bit i can extract at most a single bit of information; the way i encode it doesn't change that; i could encode the sign bit by using 10 bits of storage, these 10 bits would still store one bit of information. And this redundant information wouldn't affect at all the soundess of the integer |> make_seed |> SHA pipeline, SHA doesn't care, it would just give a distinct, random-looking output for distinct inputs, however they are encoded.

Those are both "mathematical functions". Both map an input to an output

By "both", are you refering to the examples I gave, i.e. rand(x) and hash(x) ? rand(x) is not mapping an input to an output:

julia> rand(1)
1-element Vector{Float64}:
 0.8993653769107961

julia> rand(1)
1-element Vector{Float64}:
 0.7678647843242529

For a given input 1, there isn't any fixed output.

What you're describing with P1 and P2 sounds to me like bijectivity

No, P1 is x == y implies f(x) == f(y) which is a property of math functions (and which rand doesn't satisfy), and P2 is injectivity f(x) == f(y) implies x == y. Bijectivity is injectivity + surjectivity. Here I don't specifically bother about surjection, that's why I don't try to have every possible Vector{UInt32} be an encoding of an integer.

Since we've already established in the other PR that this function cannot be injective (and thus not bijective) precisely because of BigInt, it seems odd to me to want to cling to this.

I will repeat for the last time: while make_seed is truly injective, of course, in theory, SHA is ridiculously NOT injective, an infinite number of inputs can map to the same output, but for all practical matter, it can be considered injective.

Why is that important? We intentionally barely guarantee anything about randomness. Why guarantee this?

It's not crucial, but nice to have and trivial to implement. We have seed! for a reason, to reproduce a computation; if i want to reproduce a computation on another machine with a difference architecture, it's just nice to be able to have the stream of random numbers be reproducible.

Why should a user expect this to be the case, when in tons of other contexts (e.g. hashing) we don't provide stableness across architectures at all?

AFAIK, hash implements P1, e.g. hash(Int64(-1)) == hash(Int32(-1)) == hash(big(-1)). hash goes to great length to ensure that the inputs are considered as math integers rather than only bag of bits in order to implement P1.

No, it's just as trivial.

It can't be trivial if no one on earth is able to provide an instance of two distinct seeds producing the same stream. It's trivial only to an oracle.

So you mean to tell me that it's more likely that a user runs into an overlap between 0x80 vs. -1, no matter the interpretation?

I'm talking in terms of ==. I'm saying: with the other PR ("bag of bits" interpretation) the distinct seeds Int8(-1) and 0xff will lead to the same streams, and there are many such simple examples. With this PR here ("math interpretation"), it's practically impossible that anyone will ever find two distinct seeds leading to the same streams. But of course, if we had an equality operator for "bag of bits", say =_=, then the same property would hold for the other PR: if x =_= y is false, then the streams will be distinct. But afaik we don't have such an operator in julia. Integers are compared with == which uses the mathematical interpretation.

Yes, but this PR does nothing to prevent that, save for pushing an additional zero(UInt32) when the input is ::Unsigned (which would be easy to add to the other PR as well, which would still also result in biasing).

I don't understand the "save for", encoding the sign bit in the upper bit seed[end] is excatly designed to prevent that; there are many ways to encode this bit, you just have to choose one. But I don't understand you concern with the biasing thing, there is no biasing happening here. In this PR, every integer is mapped to a dermined bit field of size 256 (output of SHA), which is random looking and cannot be computationally mapped back to the original integer seed. For whatever input seeds you choose, the distribution of the outputs 256-bitfields is uniform, there is no bias. And for all practical matters, no two integers will ever map to the same 256 bits, which is nice to have.

# we assume that `unsigned` can be called on integers `n` for which `abs(n)` is
# negative; `unsigned` is necessary for `n & 0xffffffff` below, which would
# otherwise propagate the sign bit of `n` for types smaller than UInt32
n = unsigned(n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this introduce a type instability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this should be handled fine by union-splitting.

@nhz2
Copy link
Contributor

nhz2 commented Sep 21, 2023

There is a way to implement this that will have no effect on existing uses of seed!. The current version of make_seed can never start with two UInt32(0) in a row. This could be used as a way to mark that an integer is negative.

@Seelengrab
Copy link
Contributor

Only if the signbit continues to be masked away, in order not to introduce bias into seed creation.

@nhz2
Copy link
Contributor

nhz2 commented Sep 21, 2023

Yes, I've been thinking about how to do that in a type stable way, there is the function Base.uabs which seems perfect. Unfortunately, it is not part of the Integer API, so packages like SaferIntegers.jl don't support it. But using ~ should work for any integer.

force_positive(x) = (signbit(x), signbit(x) ? ~x : x)

@rfourquet
Copy link
Member Author

The current version of make_seed can never start with two UInt32(0) in a row. This could be used as a way to mark that an integer is negative.

Yes that's a valid alternative I think, and I also mentioned other solutions above which don't "have effect on existing uses of seed!."
I chose the compromise here where most seeds are not "broken", and where only one bit is used to encode the sign. I'm open to something else, if people argue against.

there is the function Base.uabs which seems perfect.

Oh i didn't know about it! yes that or even better your force_positive function would be a good alternative to abs plus unsigned. Although tbh i'm not really concerned with the type instability caused by this conditional use of unsigned.

@PallHaraldsson
Copy link
Contributor

probably many people tried something like seed = rand(Int); seed!(rng, seed) and saw it failing.

Maybe that's a good thing, since this is not (more) random at all.

@rfourquet
Copy link
Member Author

rfourquet commented Sep 21, 2023

Maybe that's a good thing, since this is not (more) random at all.

For seeding you need to start somewhere! there are valid cases for seeding an RNG from the output of another RNG, be it RandomDevice: seed = rand(RandomDevice(), Int); seed!(rng, seed). But this is anyway not the point of this PR, you can already do that with positive seeds, allowing negative seeds won't change that matter.

@nhz2
Copy link
Contributor

nhz2 commented Sep 21, 2023

I think it is better to not try and compromise on whether this affects existing uses of seeding. Either it does or it doesn't. If it doesn't then any code depending on the exact output of a seeded RNG will continue to work.

@nhz2
Copy link
Contributor

nhz2 commented Sep 21, 2023

Also looking at other RNG libraries StableRNGs.jl and Random123.jl, neither handles negative seeds: StableRNGs.jl will error, and Random123.jl uses seed % UInt32, seed % UInt64, and seed % UInt128.

@Seelengrab
Copy link
Contributor

If it doesn't then any code depending on the exact output of a seeded RNG will continue to work.

As documented, the RNG stream is allowed to change in any version. It being bothersome to check for exact numeric output is the entire point; RNG is not a stable thing, and shouldn't be used for testing in that manner. If there's code that relies on an exact stream of random numbers for correctness, that's either malicious or a bug waiting to happen.

@nhz2
Copy link
Contributor

nhz2 commented Sep 21, 2023

Yes, the RNG stream is allowed to change for "algorithmic improvements", but I don't think allowing negative seeds is improving any RNG algorithms.

@Seelengrab
Copy link
Contributor

No. The RNG stream is allowed to change for pretty much any reason, not just "algorithmic improvements". It's not guaranteed to stay the same; only the distribution is.

@rfourquet
Copy link
Member Author

Yes there is even an issue #37165 suggesting to change the RNG streams in every minor julia version!

Also looking at other RNG libraries StableRNGs.jl and Random123.jl, neither handles negative seeds:

I don't see that as an argument against this PR though. StableRNGs can easily be updated to accept negative seeds, and I guess Random123 as well.

Random123.jl uses seed % UInt32, seed % UInt64, and seed % UInt128.

I didn't find this code, but i hope it can be changed, this looks like simply discarding bits of the input 😵

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 22, 2023

For seeding you need to start somewhere! there are valid cases for seeding an RNG from the output of another RNG, be it RandomDevice: seed = rand(RandomDevice(), Int); seed!(rng, seed).

What are the valid reasons? I realize the RNG isn't a cryptographic RNG, but in case people think it is, then this is insecure. At least on 32-bit Int is Int32, so this seems very brute-forcable. On 64-bit, maybe even there?

So, maybe it's a good thing that will fail with 50% probability. And even if this has some non-crypto uses, I'm still curious for what.

If you just need to start somewhere, then you already do, and you might want to just obtain that seed? How is that done? I guess but it wasn't Random.seed!().seed

If the

The RNG stream is allowed to change for pretty much any reason, not just "algorithmic improvements".

I suggested the new RNG, so I'm hoping it's good; for all time... no need for "algorithmic improvements", but still while we do not promise, isn't it very unlikely we change for any other reason? I did see the intriguing linked proposal (which predated the new RNG since 1.7 in November 2021 Nov 2021) to change for each minor version, I assume that means for each updated y (and x, but NOT z) for x.y.z, to wean people off the incorrect assumption.

If you obtain a new seed is that to just have to be able to do it repeatedly in a session (not having to restart Julia)? Or to also stare that seed for reproducability later?

If you really want to do this, it seemed you would simply zero the sign bit, and I don't understand why the code this complex, nor read the long discussion on it...

Still reviewing a bit I see:

function make_seed(n::Integer)

# inverse of make_seed(::Integer)
from_seed(a::Vector{UInt32})::BigInt = sum(a[i] * big(2)^(32*(i-1)) for i in 1:length(a))

It seemed to me excessive to allow more than Int128 to make a seed, do we actually want or need to allow e.g. BigInt there, or non-builtin Integer types?

[And I didn't know before of BigInt used anywhere by Julia itself. I would like to know how hypothetically possible it would be to excise the RNG (and also SHA it depends on); and BigInt (and BigFloat and Rational; LinearAlgebra and OpenBLAS too actually, for BLIS.jl). Basically nobody should be using BigFloat, ArbNumerics.jl better, and BigInt could stay but can be bundled as stdlib, not in sysimage. All breaking changes I realize... I thought at least the new RNG independent of BigInt (and of SHA, seems like our addition).]

@rfourquet
Copy link
Member Author

What are the valid reasons? I realize the RNG isn't a cryptographic RNG, but in case people think it is, then this is insecure. At least on 32-bit Int is Int32, so this seems very brute-forcable. On 64-bit, maybe even there?

It's not secure to seed your RNG with an integer if you don't want your stream to not be brute-forceable. Not all uses of RNGs require this property though. For example in a package's tests it's fine to seed an rng with (small) integers,
or say you run a simulation, it's totally valid to use e.g. Xoshiro(rand(RandomDevice(), UInt), or Xoshiro(rand(RandomDevice(), UInt128) if you want more certainty that you will never initialize the RNG with the same seed. Xoshiro.(rand(RandomDevice(), Int, 16)) would even be fine if you need to use 16 different tasks each having their dedicated local rngs.

Given that seeds go through a SHA algorithm, I would even be confident to use the global RNG instead of RandomDevice, as in Xoshiro.(rand(Int, 16)) ! Although I will not recommand that to others because I'm not an expert, and I believe it's generally not recommanded to seed an rng with the output of another rng of the same type, but it's fine to seed an rng with the output of another rng which has at least the same "quality" (see e.g. https://rust-random.github.io/rand/rand_core/trait.SeedableRng.html#method.from_rng).

So, maybe it's a good thing that will fail with 50% probability.

I don't think this is the proper way to alert people of good practices. A user will not think "oh this is a hint that i'm doing something dangerous", but maybe just wonder "why Random is not able to accept negative seeds, this must be so trivial to implement"

If you just need to start somewhere, then you already do, and you might want to just obtain that seed? How is that done? I guess but it wasn't Random.seed!().seed

With "you need to start somewhere", I meant that a PNRG needs to get entropy from somewhere when it's being seeded. Either you provide a hardcoded seed, or you get a random seed from "somewhere", which can finely be rand(rng, Int), or rand(RandomDevice(), Int128), depending on your needs.

If you obtain a new seed is that to just have to be able to do it repeatedly in a session (not having to restart Julia)? Or to also stare that seed for reproducability later?

An example is in the testsuite of julia, where a seed is selected at random with seed = rand(RandomDevice(), UInt128) (see the test/choosetests.jl file), and if the tests go wrong, the seed is printed, such that you have some chances to reproduce the failing test locally, provided you use the same julia version (possibly the same commit).

If you really want to do this, it seemed you would simply zero the sign bit,

I won't duplicate the discussion here, but in short it's such that two distinct seeds lead to two distinct streams of random numbers. Zeroing out the sign bit wouldn't achieve this property that I want to maintain for make_seed.

It seemed to me excessive to allow more than Int128 to make a seed, do we actually want or need to allow e.g. BigInt there, or non-builtin Integer types?

We already allow BigInt, so it would be breaking to not allow them anymore (though maybe no-one really uses that feature, so might be considered as a minor-breaking change), but I see no reason to artificially disallow BigInt.

[And I guess supporting BigInt in make_seed doesn't change anything concerning the excision of BigInt: make_seed is a generic function which doesn't rely on BigInt being available].

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 22, 2023

We already allow BigInt, so it would be breaking to not allow them anymore (though maybe no-one really uses that feature, so might be considered as a minor-breaking change), but I see no reason to artificially disallow BigInt.

I'm not proposing disallowing it (now) for no good reason. I guess it was included for completeness, but it does actually conflict with your other goal because of the pigeonhole principle (in case people actually exploit BigInt to its full extent, not just numbers easily represented in Int or Int128):

in short it's such that two distinct seeds lead to two distinct streams of random numbers.

I don't really need to know why that's important, I think I can see it, but it seems less needed for at least when a random number is used as a seed. [I note you can't generate rand(BigInt), for a good reason, and I just didn't think it would be used for seed! either, nobody would use both in the same sentence. There are other non-built-in Integer types (potentially) people might have had in mind wanting to cover, e.g. SaferInteger would apply (all could cast, or SaferInteger have to know of seed!...) , and you can't do a union of types not know. I see no way to limit to all Integers except BigInt, or to at most some size. It's not a huge issue to depend on BigInt, until it's excised, I just brought this up since I would want to do that eventually.]

So, maybe it's a good thing that will fail with 50% probability.

I don't think this is the proper way to alert people of good practices.

Well you're right, but you get an error for negative, so then you alert, and it's also documented (I assumed too soon, it's not in the help text and should be, maybe it is elsewhere). The rand issue doesn't apply for test cases(?), and non-rand seed shouldn't even be used. It just seems to me you're spending non-trivial amount of time to solve a perceived (and avoidable) problem, that many don't have. Thanks for doing it if it's real.

@rfourquet
Copy link
Member Author

rfourquet commented Sep 22, 2023

It just seems to me you're spending non-trivial amount of time to solve a perceived (and avoidable) problem, that many don't have. Thanks for doing it if it's real.

Thank you, yeah I've never bothered much about negative seeds, but thought about allowing them once in a while when I get an error like mentioned above. The only reason I went on implementing them now was because i was reviewing open PRs with the RNG label to triage them a bit, and got reminded of #46190; I could have just closed it right away, but for some reason decided to instead spend one hour or so to implement the feature in the way i had laid out there. The non-trivial amount of time is mostly spent on the review process now.

There are other non-built-in Integer types (potentially) people might have had in mind wanting to cover, and you can't do a union of types not know

Indeed, and here we do stuff in a generic way, I see no point in artificially restricting the allowed input types.

it's not in the help text and should be, maybe it is elsewhere

see help for MersenneTwister, and i have another open PR which this PR adds docs for Xoshiro

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 22, 2023

I think it should be under (what people use directly?):

help?> Random.seed!

[also should it be exported? It isn't (seems for help, at least, that it should look into stdlibs even if not exported, also before using), and you write code as if it is... seed = rand(Int); seed!(rng, seed)]

And I guess documented too for Xoshiro, that people don't use directly, and might be ignorant exists... or of course you merge this PR when done...

@Seelengrab
Copy link
Contributor

Seelengrab commented Sep 22, 2023

Well, I see that we're never going to agree on this. I guess I'll remove myself from the discussion and remain with the note that I think departing from existing RNG literature (which only cares about bitpatterns, as far as I'm aware) seems like an avoidable mistake to me 🤷 It's odd to me that only part of a type matters to you (i.e., Unsignedness matters, but the specific UInt8ness does not).

@@ -21,6 +21,13 @@ multiple interleaved xoshiro instances).
The virtual PRNGs are discarded once the bulk request has been serviced (and should cause
no heap allocations).

The `seed` may be an integer or a vector of `UInt32` integers.
Copy link
Member Author

@rfourquet rfourquet Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a but undecided about committing yet to supporting Vector{UInt32} as seeds. This support is perhaps useful because MersenneTwister (so the old GLOBAL_RNG) accepted these seeds. So we might want to silently support that for TaskLocalRNG() only. I've some more ideas on how to iterate on seeding stuff, so I'm inclined to postpone this documentation bit.

Suggested change
The `seed` may be an integer or a vector of `UInt32` integers.

TODO: add allowed seeds as a type annotation in Xoshiro(seed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to use a vector of 4 random UInt64 for seeding TaskLocalRNG(). This nicely matches the size of the Xoshiro State and the output of sha256.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah perfect! So my idea was that low entropy seeds like integers should ultimately go through SHA. But we could consider that when you pass an array of 4 UInt64 (or 8 UInt32) words, these 256 bits already have sufficient entropy and can be used directly to initialize the state. So my idea was to allow any array of integer bits types, but only of the correct size (256 bits in total), and to skip SHA in this case. Tuples of integer bit integers could also be accepted. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with doing an extra SHA, because it may be hard to explain the concept of sufficient entropy in the docs of seed!. Also, it would be nice to support seeding from AbstractVector{UInt8}
Currently, to seed the RNG from a user-provided string I do the following

Random.seed!(collect(reinterpret(UInt64, sha256("seed1"))))

But I think it would be much nicer if the following worked.

Random.seed!(codeunits("seed1"))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a local branch actually implementing seed!("the seed")! but didn't push it yet because there are enough PR going.

@rfourquet
Copy link
Member Author

I think it should be under (what people use directly?): seed!

Yes I added a stub to seed!'s doc here.

departing from existing RNG literature (which only cares about bitpatterns, as far as I'm aware) seems like an avoidable mistake to me

Yes, thank you for the time and energy spent in your review, but I agree that we have reached a point where it seems clear we won't agree :)
Given that I don't see this direction as a mistake, I will probably go forward with it, while giving a bit more time for potential other reviews. But I won't document the properties (P1 and P2) that are implemented here just yet, so if this turns out to be actually a mistake, we can always roll backward and use another implementation more in line with your way.

@nhz2
Copy link
Contributor

nhz2 commented Sep 22, 2023

I agree that properties P1 and P2 make the seeding process much easier to understand.

Yes there is even an issue #37165 suggesting to change the RNG streams in every minor julia version!

This might be a nice idea in theory, but in practice, it could be mildly annoying. For example, I have a large number of tests that should pass for most random streams, but may sometimes fail if I get very unlucky, by Murphy's law I think someone using Julia will have some test fail every time the RNG stream changes, even if the probability of any individual test failing is less than 0.1%. Looking at https://juliahub.com/ui/Search?q=Random.seed%21%280x&type=code there are a bunch of package tests that could be affected by this change.

I understand that the RNGs in Random may change across minor Julia versions, and so far all of the changes have been great and worth any mild annoyances, but in this case, I'm not so sure, when there is a way to implement this functionality without affecting any existing uses.

@Seelengrab
Copy link
Contributor

Seelengrab commented Sep 23, 2023

For example, I have a large number of tests that should pass for most random streams, but may sometimes fail if I get very unlucky

A test that can fail if you're unlucky (to me at least), indicates either a buggy test, a buggy piece of code or both in the worst case, when a buggy testsuite hides a buggy piece of code and gives the illusion of passing. Pretty much all of those 12 packages from your link should either hardcode the specific test data they rely on in their tests, or use a testing strategy that doesn't rely on seeding an RNG to a specific value. Quite a few of the packages have also already migrated away from seeding in their newest version, presumably due to the last time the RNG stream changed.

@andrewjradcliffe
Copy link
Contributor

@rfourquet Vigna's papers are accessible from his web site.

Alternative to #46190, see that PR for background.
There isn't a strong use-case for accepting negative seeds, but probably many people
tried something like `seed = rand(Int); seed!(rng, seed)` and saw it failing.
As it's easy to support, let's do it.

This might "break" some random streams, those for which the upper bit of
`make_seed(seed)[end]` was set, so it's rare.
@rfourquet
Copy link
Member Author

Regarding changed random streams: I will merge this implementation, in particalar given that another change will probably either restore them, or break them completely by the time we do a feature freeze for 1.11.

@rfourquet rfourquet merged commit 8ab635d into master Sep 29, 2023
4 of 6 checks passed
@rfourquet rfourquet deleted the rf/negative-seeds branch September 29, 2023 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants