Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random: allow negative seeds #51416

Merged
merged 2 commits into from
Sep 29, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Random: allow negative seeds
Alternative to #46190, see that PR for background.
There isn't a strong use-case for accepting negative seeds, but probably many people
tried something like `seed = rand(Int); seed!(rng, seed)` and saw it failing.
As it's easy to support, let's do it.

This might "break" some random streams, those for which the upper bit of
`make_seed(seed)[end]` was set, so it's rare.
  • Loading branch information
rfourquet committed Sep 29, 2023
commit 8c3ba6f6a985dec863844e5dfd5da0ecc5f81f18
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ Standard library changes

#### Random

* When seeding RNGs provided by `Random`, negative integer seeds can now be used ([#51416]).

#### REPL

* Tab complete hints now show in lighter text while typing in the repl. To disable
Expand Down
55 changes: 43 additions & 12 deletions stdlib/Random/src/RNGs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -83,12 +83,12 @@ MersenneTwister(seed::Vector{UInt32}, state::DSFMT_state) =
Create a `MersenneTwister` RNG object. Different RNG objects can have
their own seeds, which may be useful for generating different streams
of random numbers.
The `seed` may be a non-negative integer or a vector of
`UInt32` integers. If no seed is provided, a randomly generated one
is created (using entropy from the system).
See the [`seed!`](@ref) function for reseeding an already existing
`MersenneTwister` object.
The `seed` may be an integer or a vector of `UInt32` integers.
If no seed is provided, a randomly generated one is created (using entropy from the system).
See the [`seed!`](@ref) function for reseeding an already existing `MersenneTwister` object.

!!! compat "Julia 1.11"
Passing a negative integer seed requires at least Julia 1.11.

# Examples
```jldoctest
Expand Down Expand Up @@ -290,20 +290,51 @@ function make_seed()
end
end

"""
make_seed(n::Integer) -> Vector{UInt32}

Transform `n` into a bit pattern encoded as a `Vector{UInt32}`, suitable for
RNG seeding routines.

`make_seed` is "injective" : if `n != m`, then `make_seed(n) != `make_seed(m)`.
Moreover, if `n == m`, then `make_seed(n) == make_seed(m)`.

This is an internal function, subject to change.
"""
function make_seed(n::Integer)
n < 0 && throw(DomainError(n, "`n` must be non-negative."))
neg = signbit(n)
n = abs(n) # n can still be negative, e.g. n == typemin(Int)
if n < 0
# we assume that `unsigned` can be called on integers `n` for which `abs(n)` is
# negative; `unsigned` is necessary for `n & 0xffffffff` below, which would
# otherwise propagate the sign bit of `n` for types smaller than UInt32
n = unsigned(n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this introduce a type instability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this should be handled fine by union-splitting.

end
seed = UInt32[]
while true
# we directly encode the bit pattern of `abs(n)` into the resulting vector `seed`;
# to greatly limit breaking the streams of random numbers, we encode the sign bit
Copy link
Contributor

@Seelengrab Seelengrab Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stream of random numbers is documented not to be stable across versions, even for the same seed. This seems like a lot of additional code to avoid breaking something we don't guarantee in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definitely break the random streams, but it can be some work then to fix the seeds in LinearAlgebra! And probably for some people in the wild. But I don't think that not breaking most seeds didn't represent much more code; the alternatives I saw were:

  1. storing the sign bit as the first bit of seed[0], which would require shifting
  2. storing the sign bit in a dedicated UInt32, e.g. in seed[1] or seed[end] (probably the simplest option)
  3. storing a hard coded string of random UInt32 numbers at the beginning of seed! to encode negativity (e.g. UInt32[0xbc7f82b8, 0x53f967d9, 0x538ab2e4, 0x7c8725a9]); this would also have been "non-stream-breaking"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the other PR, I still don't understand why the sign bit needs to be treated specially at all. It's a bit. make_seed certainly doesn't make use of the fact that what's passed in is a mathematical number; it only cares for the bits. Why is this one bit special to the purpose of make_seed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already explained at length my reasonning in the other PR, I'm afraid I don't have much more ideas on how to explain this.

A Julia integer can be seen as a bag of bits, but it's first and foremost an object representing a mathematical number. I simply want (1) n == m implies make_seed(n) == make_seed(m), and (2) n != m implies make_seed(n) != make_seed(m), without caring about the bits which encode n or m. I just need to care about the bits (in particular the sign bit) in the implementation, which must simply do what it takes to honor these properties.

Property (1) is what we expect from mathematical functions, and property (2) is just useful for seed!: after all we have the following in the docstring for MersenneTwister:

Different RNG objects can have their own seeds, which may be useful for generating different streams
of random numbers

Yes, with different seeds, we expect different streams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My objection is with the interpretation that make_seed takes a mathematical integer and maps that to a stream of numbers. It doesn't - it takes the given bag-of-bits and initializes the RNG state with that. A Julia Int(8/16/32/64/128) (which are really from a modular ring, not the entire integers) is just the most convenient & packed representation of that we have.

This is necessarily an operation that can't produce unique RNG objects for every possible seed, by the pigeonhole principle: the state of the PRNG is of finite size. You can of course say that you only really care about finitely sized inputs (so Base.BitIntegers, basically), but then you also have to prove that those different inputs not only produce different RNG initial streams, but that the stream of numbers (and the associated states) doesn't overlap (otherwise you're on a shared cycle). This is already not the case for any RNG based on LFSR, which merely has long cycle lengths.

So, with that in mind, I don't understand why those two properties in particular are desirable, other than to give the illusion of uncorrelated RNG streams. I get that the properties seem nice, but you don't really get much out of them algorithmically. Documenting that make_seed by default cares only about the bits of whatever object its given seems just as simple to document & easier to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting that make_seed by default cares only about the bits of whatever object its given seems just as simple to document & easier to maintain.

I agree it might be easier to maintain, but not by much as this PR shows. But also, again, I see no particular reason to treat integers as bag of bits: it's only an implementation detail of make_seed. Mostly everywhere else, we treat julia integers as mathematical integers as much as reasonably possible. hash for example does that.

Besides, by just applying unsigned to the input like in #46190, we still have to handle BigInt separately, and this wouldn't even address the case of other non-bits integers in the wild. Here make_seed will work for any integer implementation providing a view of their instances as two's complement (which BigInt does).

There are two steps in the process:

  1. map integers to vector seeds suitable to initialize RNGs
  2. map vectors from 1) into random streams

If the two properties (P1) mathematical function and (P2) injectivity, i.e. x == y iff f(x) == f(y), hold for each step, then it holds for the whole process.
This PR addresses only 1), and implements P1 and P2.

For 2), (P1) is already generally implemented: for a given initialization vector, the RNGs implemented by Random are initialized into a deterministic state.
For (P2), of course, as already mentioned in the other PR, true injectivity can't hold, but we get it in practice by using SHA2_256.

but then you also have to prove that those different inputs not only produce different RNG initial streams, but that the stream of numbers (and the associated states) doesn't overlap

This is a good point, that users should be aware of; though I don't have to prove anything, I'm not trying to solve this here, only trying to do the best possible thing for seeding. To be certain to avoid overlapping streams, the only way I know of (for the RNGs we have) would be to use "jumping", which is already implemented for MersenneTwister, and there is an open PR for Xoshiro. For users not using jumps, I think the best we can do is to make seed!(::Integer) produce different streams for different inputs (and the same streams for == inputs). In practice, two distinct streams will be extremely unlikely to overlap during program execution. For example, if i'm not mistaken, this paper (i could only read the intro, but it gives the formula) suggests that for n processors each producing sequences of lenght L for a state with period P, "the probability of overlap is always at most n^2 * L / P". For example, setting n = big(10.0)^4; L = big(2.0)^50; P = big(2.0)^256 (Xoshiro has period 2^256-1), this formula gives a probability of getting overlapping sequences upper bounded by 2^(-179), i.e. were we to trust this paper, with these parameters above, it's more likely to find a collision in SHA2_256 than to get overlapping sequences with Xoshiro.

I don't understand why those two properties in particular are desirable, other than to give the illusion of uncorrelated RNG streams

It's maybe an illusion in theory, but with the arguments above, i believe that for all practical matters, these properties go a long way into ensuring uncorrelated streams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly everywhere else, we treat julia integers as mathematical integers as much as reasonably possible.

No; Julia integers are not mathematical integers (i.e., the set of natural numbers, if that's what you mean). They are modular integers.. make_seed ignores that in any case, and is only interested in the "bag of bits" behavior of them - otherwise operations like & 0xffffffff wouldn't make sense in the first place, and would have to be written like % 0x100000000 instead.

Besides, by just applying unsigned to the input like in #46190, we still have to handle BigInt separately, and this wouldn't even address the case of other non-bits integers in the wild. Here make_seed will work for any integer implementation providing a view of their instances as two's complement (which BigInt does).

Naive conversion via unsigned still propagates the sign bit:

julia> reinterpret(Int8, 0xff) |> unsigned
0xff # _not_ 0xef!

which is exactly why I chose it in the original PR, in order to not have to take special considerations. So by special casing signbit, you've introduced a bias where the sign bit is counted twice if it's set (once during the conversion to unsigned, then again in your if neg), meaning it contributes more to the seeded value than the other bits. That's why it's dangerous to do this kind of special casing, and why it should be avoided at all costs.

There are two steps in the process:

  1. map integers to vector seeds suitable to initialize RNGs

"suitable" is doing a lot of heavy lifting here. Why is the direct interpretation of "bag of bits" not suitable - other than losing numerical interpretation, which make_seed has no history of making in the first place?

  1. map vectors from 1) into random streams
    If the two properties (P1) mathematical function and (P2) injectivity, i.e. x == y iff f(x) == f(y), hold for each step, then it holds for the whole process.

Could you expand on what you mean with "(P1) mathematical function"? That is not a testable property of a piece of code, as far as I'm aware. What criterion makes a function "mathematical"?

For (P2), of course, as already mentioned in the other PR, true injectivity can't hold, but we get it in practice by using SHA2_256.

SHA cannot make a non-injective input injective overall. The failure of injectivity already happened before SHA ever saw any input; it cannot "disentangle" that without additional information, which is not provided as far as I can tell.

or users not using jumps, I think the best we can do is to make seed!(::Integer) produce different streams for different inputs (and the same streams for == inputs).

Yes, I agree with this. But I don't agree with "different inputs" meaning "different numerical interpretations of a string of bits" for the purpose of seeding an RNG. make_seed and RNGs in general do not care about numerical interpretation of the input seed; they only care about the raw bits, without interpretation.

To be certain to avoid overlapping streams, the only way I know of (for the RNGs we have) would be to use "jumping", which is already implemented for MersenneTwister, and there is an open PR for Xoshiro. For users not using jumps, I think the best we can do is to make seed!(::Integer) produce different streams for different inputs (and the same streams for == inputs).

I don't follow here. This is not about jumping or multiprocessing at all - this is about what make_seed interprets its input as. I'm arguing that it should only care about the bits, not the type/domain. You're arguing for the opposite. Any existing user will be wholly unaffected by this (or the other PR), because the existing working inputs don't suddenly overlap with existing working inputs. It's only non-working inputs that may overlap with existing working inputs (which is, again, inevitable due to the nature of finite space).

Further, any user relying on task-based multithreading in julia will (or does) already benefit from jumping, both in MT as well as Xoshiro; the tasks are already seeded with a jumped value from the parent RNG.

It's maybe an illusion in theory, but with the arguments above, i believe that for all practical matters, these properties go a long way into ensuring uncorrelated streams.

No, it's also an illusion in practice, and one we should document & make visible, so that people can avoid shooting themselves in the foot by accident, by encouraging use of jumping and heavily discouraging seeding multiple RNGs. The correct procedure for getting multiple statistically uncorrelated RNG streams is to seed one parent RNG and create multiple sub-RNGs by jumping. Seeding multiple RNGs and hoping for uncorrelation is not sound.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_seed ignores that in any case, and is only interested in the "bag of bits" behavior of them - otherwise operations like & 0xffffffff wouldn't make sense in the first place, and would have to be written like % 0x100000000 instead.

I don't understand what makes you say that make_seed is only interested in the bag of bits. The current implementation can lead to both interpretations. the fact that we use bitwise operations is just an implementation detail: they are the tools we have to dissect and assemble integers; I again concede that for that we need to make the assumption that these integer types expose a two's complement "interface", i.e. bitwise operation.

Naive conversion via unsigned still propagates the sign bit:

I dont' understand this point.

you've introduced a bias where the sign bit is counted twice if it's set (once during the conversion to unsigned,

I believe not: one way to see that is that, unless of course the implementation in this PR is faulty, it implements the following: x == y iff f(x) == f(y). Now consider the path taken for x::BigInt: abs(x) can't be <0, and no unsigned call is involved. Therefore, for BigInt, the sign bit is not counted twice: it's removed from x when taking abs, and is "tranferred" into neg, which is then used to set one bit in the resulting array. Now for any non-BigInt y equal to x, we have make_seed(x) == make_seed(y), so the sign bit of y is not counted twice.

then again in your if neg), meaning it contributes more to the seeded value than the other bits. That's why it's dangerous to do this kind of special casing, and why it should be avoided at all costs.

But even then, even if the sign bit was counted twice, for whatever that means, I fail to see what problem this would lead to.

Why is the direct interpretation of "bag of bits" not suitable - other than losing numerical interpretation,

I think it's suitable to a great extent, but I like the interpretation as "math integers" better here, as for example seed(-1) will lead to the same stream on 32 vs 64 bits platforms.

which make_seed has no history of making in the first place?

As an anecdote, make_seed was initially accepting negative numbers (67f6fee), but was changed shortly after (3c5946c). Tere is not a lot of information in these commits, but I would guess the change was precisely to get consistent streams across 32 and 64 bits platforms (but i could be totally wrong there).

Could you expand on what you mean with "(P1) mathematical function"? That is not a testable property of a piece of code, as far as I'm aware. What criterion makes a function "mathematical"?

Sorry I was too terse I guess. The property x == y iff f(x) == f(y) contains
(P1) "mathematical function", i.e. x == y implies f(x) == f(y)
(P2) injectivity, i.e. f(x) == f(y) implies x == y

So rand(x::Integer) is definitely not a "mathematical function", but hash(x) is. Likewise, I want make_seed to be mathematical.

SHA cannot make a non-injective input injective overall. The failure of injectivity already happened before SHA ever saw any input; it cannot "disentangle" that without additional information, which is not provided as far as I can tell.

I don't get this. The pipeline being x |> make_seed |> SHA, if SHA is injective and if make_seed is injective, then SHA ∘ make_seed, is injective, i.e. seed! essentially.

I'm arguing that it should only care about the bits, not the type/domain

Yes and I don't see any advantage to that, except slightly simpler code if we didn't bother to accept BigInt; but BigInt must be supported.

It's only non-working inputs that may overlap with existing working inputs (which is, again, inevitable due to the nature of finite space).

But the big difference is that with the other PR using unsigned, it's trivial to construct two distinct (!=) seeds which produce the same streams, whereas the solution proposed here make this practically impossible.

No, it's also an illusion in practice, and one we should document & make visible, so that people can avoid shooting themselves in the foot by accident, by encouraging use of jumping and heavily discouraging seeding multiple RNGs

I'm all for making jumping more visible, and finding it useful, I even implemented the jumping code for julia's MersenneTwister. But at the same time, users will sometimes just not care about it, because seeding a bunch of integers just works in practice, e.g. Xoshiro.((1, 2, 3, 4)). I'm not an expert on the subject, but I tried to provide some justification for why the 4 resulting Xoshiro instances have such an astronomically small chance of running into overlapping sequence that it's impossible in practice. In other words, barring bugs, I believe no-one will ever run into overlapping sequences by accident when using distinct integer seeds. On the other hand, it's very easy to run in the same seqence from two distinct seeds if we interpret them as bag of bits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now for any non-BigInt y equal to x, we have make_seed(x) == make_seed(y), so the sign bit of y is not counted twice.

It is counted twice:

julia> make_seed(0x80000001)
2-element Vector{UInt32}:
 0x80000001
 0x00000000

Once in propagation of the original data (hence the leading 8), and once when pushing the additional zero! You're "extracting" two bits of information from a single input bit, whereas the other bits only contribute a single bit themselves. That is a bias in favor of that single bit, which ends up contributing 33 bits of "information" to the input of SHA256 (which is always going to be zero, or not exist at all - meaning, it might as well not be there!) There is no additional entropy being added; all you're doing is mapping to a different set of SHA256 outputs due to the Unsignedness (on the type level) of the input.

So rand(x::Integer) is definitely not a "mathematical function", but hash(x) is. Likewise, I want make_seed to be mathematical.

I don't get this. The pipeline being x |> make_seed |> SHA, if SHA is injective and if make_seed is injective, then SHA ∘ make_seed, is injective, i.e. seed! essentially.

Those are both "mathematical functions". Both map an input to an output. What you're describing with P1 and P2 sounds to me like bijectivity, a property some mathematical functions (i.e., functions) have. Since we've already established in the other PR that this function cannot be injective (and thus not bijective) precisely because of BigInt, it seems odd to me to want to cling to this.

I think it's suitable to a great extent, but I like the interpretation as "math integers" better here, as for example seed(-1) will lead to the same stream on 32 vs 64 bits platforms.

Why is that important? We intentionally barely guarantee anything about randomness. Why guarantee this? Why should a user expect this to be the case, when in tons of other contexts (e.g. hashing) we don't provide stableness across architectures at all?

But the big difference is that with the other PR using unsigned, it's trivial to construct two distinct (!=) seeds which produce the same streams, whereas the solution proposed here make this practically impossible.

No, it's just as trivial. Just create two random integers, convert them to BigInt, multiply them and use them as the seed. Because we're going through SHA256 later on, the larger input will be compressed into a fixed length and by the pigeonhole principle, we're pretty much guaranteed to have a collision for sufficiently large BigInt. Supporting BigInt just means throwing any sort of injectivity right out of the window, if we intend to then go through SHA.

In other words, barring bugs, I believe no-one will ever run into overlapping sequences by accident when using distinct integer seeds. On the other hand, it's very easy to run in the same seqence from two distinct seeds if we interpret them as bag of bits.

So you mean to tell me that it's more likely that a user runs into an overlap between 0x80 vs. -1, no matter the interpretation? I think this is a false equivalence; if the input bits are the exact same distribution, both approaches lead to equally likely collisions (again, by pigeonhole). There simply aren't magically more bits; signedness is not special.

Not to mention that due to SHA, "by accident" is also not going to work for the approach in the other PR. Are there bitpatterns for signed integers that overlap with the results from unsigned integers? Yes, but this PR does nothing to prevent that, save for pushing an additional zero(UInt32) when the input is ::Unsigned (which would be easy to add to the other PR as well, which would still also result in biasing).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once in propagation of the original data (hence the leading 8), and once when pushing the additional zero! You're "extracting" two bits of information from a single input bit, whereas the other bits only contribute a single bit themselves. That is a bias in favor of that single bit, which ends up contributing 33 bits of "information" to the input of SHA256

Oh I think i now see what you mean. But no, when interpreting julia integers as mathematical integers, I'm not counting the sign bit twice: a UInt32 needs 33 bits to be encoded as a math integers: 32 bits for the absolute value given by the UInt32, and one bit for the sign. IOW, in UInt32, the sign bit is implicit and doesn't require an explicit storage bit to store this information, because the information is in the type itself. make_seed is nothing more than an encoder for arbitrary-length integers, in particular the set of possible outputs is isomorphic to the set of BigInts. This transformation is useful only insofar as RNGs don't digest BigInt well, they like to eat arrays of unsigned. When you pass a value of type UInt or UInt32 or BigInt etc into make_seed, the input type is erased, only the mathematical value is taken into account. In particular, the sign of the value 0x80000001 must be encoded in this array representation, if I want to implement properties P1 and P2.

But anyway, I don't actually make much sense of "You're "extracting" two bits of information from a single input bit", for a single bit i can extract at most a single bit of information; the way i encode it doesn't change that; i could encode the sign bit by using 10 bits of storage, these 10 bits would still store one bit of information. And this redundant information wouldn't affect at all the soundess of the integer |> make_seed |> SHA pipeline, SHA doesn't care, it would just give a distinct, random-looking output for distinct inputs, however they are encoded.

Those are both "mathematical functions". Both map an input to an output

By "both", are you refering to the examples I gave, i.e. rand(x) and hash(x) ? rand(x) is not mapping an input to an output:

julia> rand(1)
1-element Vector{Float64}:
 0.8993653769107961

julia> rand(1)
1-element Vector{Float64}:
 0.7678647843242529

For a given input 1, there isn't any fixed output.

What you're describing with P1 and P2 sounds to me like bijectivity

No, P1 is x == y implies f(x) == f(y) which is a property of math functions (and which rand doesn't satisfy), and P2 is injectivity f(x) == f(y) implies x == y. Bijectivity is injectivity + surjectivity. Here I don't specifically bother about surjection, that's why I don't try to have every possible Vector{UInt32} be an encoding of an integer.

Since we've already established in the other PR that this function cannot be injective (and thus not bijective) precisely because of BigInt, it seems odd to me to want to cling to this.

I will repeat for the last time: while make_seed is truly injective, of course, in theory, SHA is ridiculously NOT injective, an infinite number of inputs can map to the same output, but for all practical matter, it can be considered injective.

Why is that important? We intentionally barely guarantee anything about randomness. Why guarantee this?

It's not crucial, but nice to have and trivial to implement. We have seed! for a reason, to reproduce a computation; if i want to reproduce a computation on another machine with a difference architecture, it's just nice to be able to have the stream of random numbers be reproducible.

Why should a user expect this to be the case, when in tons of other contexts (e.g. hashing) we don't provide stableness across architectures at all?

AFAIK, hash implements P1, e.g. hash(Int64(-1)) == hash(Int32(-1)) == hash(big(-1)). hash goes to great length to ensure that the inputs are considered as math integers rather than only bag of bits in order to implement P1.

No, it's just as trivial.

It can't be trivial if no one on earth is able to provide an instance of two distinct seeds producing the same stream. It's trivial only to an oracle.

So you mean to tell me that it's more likely that a user runs into an overlap between 0x80 vs. -1, no matter the interpretation?

I'm talking in terms of ==. I'm saying: with the other PR ("bag of bits" interpretation) the distinct seeds Int8(-1) and 0xff will lead to the same streams, and there are many such simple examples. With this PR here ("math interpretation"), it's practically impossible that anyone will ever find two distinct seeds leading to the same streams. But of course, if we had an equality operator for "bag of bits", say =_=, then the same property would hold for the other PR: if x =_= y is false, then the streams will be distinct. But afaik we don't have such an operator in julia. Integers are compared with == which uses the mathematical interpretation.

Yes, but this PR does nothing to prevent that, save for pushing an additional zero(UInt32) when the input is ::Unsigned (which would be easy to add to the other PR as well, which would still also result in biasing).

I don't understand the "save for", encoding the sign bit in the upper bit seed[end] is excatly designed to prevent that; there are many ways to encode this bit, you just have to choose one. But I don't understand you concern with the biasing thing, there is no biasing happening here. In this PR, every integer is mapped to a dermined bit field of size 256 (output of SHA), which is random looking and cannot be computationally mapped back to the original integer seed. For whatever input seeds you choose, the distribution of the outputs 256-bitfields is uniform, there is no bias. And for all practical matters, no two integers will ever map to the same 256 bits, which is nice to have.

# as the upper bit of `seed[end]` (i.e. for most positive seeds, `make_seed` returns
# the same vector as when we didn't encode the sign bit)
while !iszero(n)
push!(seed, n & 0xffffffff)
n >>= 32
if n == 0
return seed
end
n >>>= 32
end
if isempty(seed) || !iszero(seed[end] & 0x80000000)
push!(seed, zero(UInt32))
end
if neg
seed[end] |= 0x80000000
end
seed
end

# inverse of make_seed(::Integer)
from_seed(a::Vector{UInt32})::BigInt = sum(a[i] * big(2)^(32*(i-1)) for i in 1:length(a))
function from_seed(a::Vector{UInt32})::BigInt
neg = !iszero(a[end] & 0x80000000)
seed = sum((i == length(a) ? a[i] & 0x7fffffff : a[i]) * big(2)^(32*(i-1))
for i in 1:length(a))
neg ? -seed : seed
end


#### seed!()
Expand Down
2 changes: 2 additions & 0 deletions stdlib/Random/src/Random.jl
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,8 @@ sequence of numbers if and only if a `seed` is provided. Some RNGs
don't accept a seed, like `RandomDevice`.
After the call to `seed!`, `rng` is equivalent to a newly created
object initialized with the same seed.
The types of accepted seeds depend on the type of `rng`, but in general,
integer seeds should work.

If `rng` is not specified, it defaults to seeding the state of the
shared task-local generator.
Expand Down
13 changes: 13 additions & 0 deletions stdlib/Random/src/Xoshiro.jl
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,13 @@ multiple interleaved xoshiro instances).
The virtual PRNGs are discarded once the bulk request has been serviced (and should cause
no heap allocations).

The `seed` may be an integer or a vector of `UInt32` integers.
Copy link
Member Author

@rfourquet rfourquet Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a but undecided about committing yet to supporting Vector{UInt32} as seeds. This support is perhaps useful because MersenneTwister (so the old GLOBAL_RNG) accepted these seeds. So we might want to silently support that for TaskLocalRNG() only. I've some more ideas on how to iterate on seeding stuff, so I'm inclined to postpone this documentation bit.

Suggested change
The `seed` may be an integer or a vector of `UInt32` integers.

TODO: add allowed seeds as a type annotation in Xoshiro(seed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to use a vector of 4 random UInt64 for seeding TaskLocalRNG(). This nicely matches the size of the Xoshiro State and the output of sha256.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah perfect! So my idea was that low entropy seeds like integers should ultimately go through SHA. But we could consider that when you pass an array of 4 UInt64 (or 8 UInt32) words, these 256 bits already have sufficient entropy and can be used directly to initialize the state. So my idea was to allow any array of integer bits types, but only of the correct size (256 bits in total), and to skip SHA in this case. Tuples of integer bit integers could also be accepted. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with doing an extra SHA, because it may be hard to explain the concept of sufficient entropy in the docs of seed!. Also, it would be nice to support seeding from AbstractVector{UInt8}
Currently, to seed the RNG from a user-provided string I do the following

Random.seed!(collect(reinterpret(UInt64, sha256("seed1"))))

But I think it would be much nicer if the following worked.

Random.seed!(codeunits("seed1"))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a local branch actually implementing seed!("the seed")! but didn't push it yet because there are enough PR going.

If no seed is provided, a randomly generated one is created (using entropy from the system).
See the [`seed!`](@ref) function for reseeding an already existing `Xoshiro` object.

!!! compat "Julia 1.11"
Passing a negative integer seed requires at least Julia 1.11.

# Examples
```jldoctest
julia> using Random
Expand Down Expand Up @@ -191,6 +198,12 @@ endianness and possibly word size.

Using or seeding the RNG of any other task than the one returned by `current_task()`
is undefined behavior: it will work most of the time, and may sometimes fail silently.

When seeding `TaskLocalRNG()` with [`seed!`](@ref), the passed seed, if any,
may be an integer or a vector of `UInt32` integers.

!!! compat "Julia 1.11"
Seeding `TaskLocalRNG()` with a negative integer seed requires at least Julia 1.11.
"""
struct TaskLocalRNG <: AbstractRNG end
TaskLocalRNG(::Nothing) = TaskLocalRNG()
Expand Down
36 changes: 36 additions & 0 deletions stdlib/Random/test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -926,6 +926,12 @@ end
m = MersenneTwister(0); rand(m, Int64); rand(m)
@test string(m) == "MersenneTwister(0, (0, 2256, 1254, 1, 0, 1))"
@test m == MersenneTwister(0, (0, 2256, 1254, 1, 0, 1))

# negative seeds
Random.seed!(m, -3)
@test string(m) == "MersenneTwister(-3)"
Random.seed!(m, typemin(Int8))
@test string(m) == "MersenneTwister(-128)"
end

@testset "RandomDevice" begin
Expand Down Expand Up @@ -1148,3 +1154,33 @@ end
end
end
end

@testset "seed! and make_seed" begin
# Test that:
# 1) if n == m, then make_seed(n) == make_seed(m)
# 2) if n != m, then make_seed(n) != make_seed(m)
rngs = (Xoshiro(0), TaskLocalRNG(), MersenneTwister(0))
seeds = Any[]
for T = Base.BitInteger_types
append!(seeds, rand(T, 8))
push!(seeds, typemin(T), typemin(T) + T(1), typemin(T) + T(2),
typemax(T), typemax(T) - T(1), typemax(T) - T(2))
T <: Signed && push!(seeds, T(0), T(1), T(2), T(-1), T(-2))
end

vseeds = Dict{Vector{UInt32}, BigInt}()
for seed = seeds
bigseed = big(seed)
vseed = Random.make_seed(bigseed)
# test property 1) above
@test Random.make_seed(seed) == vseed
# test property 2) above
@test bigseed == get!(vseeds, vseed, bigseed)
# test that the property 1) is actually inherited by `seed!`
for rng = rngs
rng2 = copy(Random.seed!(rng, seed))
Random.seed!(rng, bigseed)
@test rng == rng2
end
end
end