Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split random.jl into a new random/ directory and clean-up #22752

Merged
merged 5 commits into from
Aug 17, 2017

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Jul 11, 2017

"base/random.jl" has slowly grown into a slightly messy file, with which it's sometimes difficult to work because of its size. The reorganization in this PR is mainly to make my life easier (in particular with a refactoring I was planning).

The first commit is the initial split of random.jl into multiple files: it's not fine-grained, as I tried to keep big blocks of code together to help reviewing. Subsequent commits reorganize things at a finer level. I'm open to suggestions for filenames (in particular I'm not conviced with the name of "sequences.jl" containing randsubseq, randperm etc.) and structure.

There should be no functional change, except for the deletion of 2 unexported functions: win32_SystemFunction036!, which was incorporated into rand!(::RandomDevice) (I could not test this change, but appveyor will tell), and globalRNG which I had intended to export in an old PR, but this was decided against because of the prospect of multithreaded RNGs.

EDIT: to clarify, here is the new file list with their dependency (probably incomplete), indicated within square bracket. Also, most files depend on 2), and on 4) for GLOBAL_RNG.
(cf. "EDIT2" below for the current re-organization)

  1. dSFMT.jl: low level wrapping of eponym library
  2. random.jl: defines abstract types and includes other files, and doc of rand/rand!
  3. RNGs.jl [1]: implementation and API of RandomDevice and MersenneTwister
  4. seeding.jl [3]: srand related functions and definition of GLOBAL_RNG (which depends on srand)
  5. numbers.jl [3]: rand for scalar types (rand(::Type{T}))
  6. arrays.jl [3, 5]: arrays of scalars, including specialization for MersenneTwister, plus bitrand
  7. ranges.jl [5, 6]: rand from ranges (rand(1:n))
  8. collections.jl [7]: rand from collections (rand(::Set)))
  9. normal.jl [5]: randn & randexp
  10. uuid.jl [5]: well, uuid
  11. sequences.jl [5, 9]: randperm, randsubseq, etc

[5] and [6] could be merged, similarly to [7] and [8] which contain both scalar and array versions.
Alternative ideas for organization could be (if we want less files): put all MersenneTwister specializations from [4,5,6] in [3], merge what remains from [5,6] together with [7,8] into "generation.jl" or some better name, keep [9] as is, and merge [10,11] together with bitrand and randstring into "misc.jl".

EDIT2: implemented the last plan above (merging some files):

  1. dSFMT.jl
  2. random.jl
  3. RNGs.jl: implementation and API of RandomDevice and MersenneTwister, and specialized random generation, and seeding
  4. generation.jl: all generic algorithms for random generation from AbstractRNG (parts of [5, 6, 7, 8] above)
  5. normal.jl
  6. misc.jl: randstring, bitrand, uuid, randperm etc.

@rfourquet rfourquet added the domain:randomness Random number generation and the Random stdlib label Jul 11, 2017
@rfourquet rfourquet force-pushed the rf/split-random branch 2 times, most recently from bc9567a to 340d263 Compare July 11, 2017 10:32
@rfourquet rfourquet changed the title split random.jl into a new random/ directory split random.jl into a new random/ directory and clean-up Jul 11, 2017
@rfourquet
Copy link
Member Author

I would like to request review on this one, as it will be difficult to rebase when other changes get merged, and I would like know where to start further work on rand: on top of this, or on top of master if this PR is not approved.

@fredrikekre
Copy link
Member

Would it not be possible to clean up in sections within the same file? Now this is split into ~10 files with ~150 lines of code each, which is not very typical for base, is it? Look at e.g. the HigherOrderFns module for a quite long file, which still is easy to follow with the descriptive header up top and the rest of the file is clearly separated into sections. Just a thought... 🇸🇪

@rfourquet
Copy link
Member Author

Other functionalities are split into directories rooted in "base/", with more or less big files, a good example with small files is "base/markdown/". Well, splitting is not strictly necessary, but I can't count the number of times I had to scroll inefficiently through "random.jl", to find relevant part of the code; admitedly I'm not good at code browsing, but this splitting would make my life easier, so why not! (of course if there aren't drawbacks I didn't foresee).

@andreasnoack
Copy link
Member

I think this is for the better. You can always argue where the line should be drawn but since you the person who is most familiar with this code, I'd go with your proposal.

@yuyichao
Copy link
Contributor

I can't count the number of times I had to scroll inefficiently through "random.jl", to find relevant part of the code

And now for anyone who didn't write this code, they need to scroll through multiple files instead of searching one file.

@rfourquet
Copy link
Member Author

@yuyichao I don't understand your comment. Having a code base split in mutiple files has some advantages. I'm simply proposing to draw the line closer to having smaller files. By saying "make my life easier", I acknowledge my selfish motivation: as I work for free, I prefer to optimize my time. But by no means I was implying "and make the life of others more difficult". I think it makes it easier for anyone who works on a rand related feature.

@yuyichao
Copy link
Contributor

Having a code base split in mutiple files has some advantages.

Those advantages are mainly for the case where each file are relatively self contained and have limited and relatively well defined interface (i.e. the functions calls and name references between files). When you have strong coupling between different part, it's not necessarily a good idea to split them. With our codebase as an example, I've also personally spent a lot of time figuring out if a codegen helper function is defined in codegen.cpp or cgutils.cpp exactly because there's strong coupling between the two files (OTOH, ccall.cpp, intrinsics.cpp doesn't really have this problem).

And that is exactly what I mean is that this does make it much harder for others to browse the code. I've personally searched through this file checking the calling between a few internal functions but now they are usually not used in the same file they are defined which makes them much harder to read.

The splitting is also not super clear for readers that didn't write all the code. As an example, there are functions that deal with Array or AbstractArray in all of random/arrays.jl, random/collections.jl, random/sequences.jl, random/ranges.jl and just from the API it's not very clear why they are where they are. This doesn't help finding the

@rfourquet
Copy link
Member Author

OK thanks for the clarification. I updated the OP to include the logic of file organization, including a proposition of a simpler scheme (i.e. less files). As I said, suggestions are welcome.

I experienced also not understanding well the organisation of some files in base and feeling it's not optimal. But here I precisely tried to clarify the dependency of functionalities and reduce "coupling" (if I understand correctly the word "coupling"), there should be no coupling between different files I created, only on-way dependencies.

they are usually not used in the same file they are defined which makes them much harder to read.

For example, some functions are defined in dSFMT.jl, but only used in random.jl: the first file defines the low-level API, the second one uses this low-level to define a higher-level API. This looks like trivial code organization. By your argument they should be merged? I would say this is subjective. Here I split this higher-level into a medium-level MersenneTwister API (in RNGs.jl), and a user-level API in numbers.jl and arrays.jl, but I have no problem putting those back in the same file (i.e. RNGs.jl).

@rfourquet
Copy link
Member Author

Rebased, add re-organized the functionalities into fewer file, cf. precisions at the bottom of OP. I'm pretty happy with the current state, I plan to merge within a couple of days if no objection.


rand(rng::RandomDevice, ::Type{CloseOpen}) = rand(rng, Close1Open2) - 1.0

@inline rand(r::RandomDevice, T::BitFloatType) = rand_generic(r, T)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this exist before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is one of the very few "functional" change; I introduced it to be DRY and to satisfy the last code re-organization: this method was previously defined for Union{RandomDevice,MersenneTwister}, but now I put the code concerning specific RNGs in their respective section. Also, eventually the plan is to allow users to define easily rand methods in terms of others, so this is one small first step in this direction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be preferable to do the code rearrangement and functional changes separately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I thought it should be harmless (non-breaking), but will amend. Other changes are: removing of the win32_SystemFunction036! (incorporated in rand!(::MersenneTwister)), and removing of the never-used never-exported globalRNG() function: do you want me to also split that out in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those changes only appear in a single place each, right? was easier to see what was going on with those

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Is it fine for you if I split the introduction of rand_generic in a separate commit but in the same PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if it were the last commit instead of partway through, since some of the last commits reorganized things that the first few commits changed so the total lines of code changed commit by commit are quite a bit more than the overall lines of code changed when viewing them combined

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rand_generic was introduced in the last commit, so I could easily split it out in a new last commit. I can also easily merge the current last 2 commits ("merge/rename some files). Would it be good enough?

@ViralBShah
Copy link
Member

I really like this.

@yuyichao
Copy link
Contributor

I updated the OP to include the logic of file organization, including a proposition of a simpler scheme (i.e. less files). As I said, suggestions are welcome.

FWIW, that still doesn't change the fact that you need to go through each file to find where a function/method is defined.

For example, some functions are defined in dSFMT.jl, but only used in random.jl: the first file defines the low-level API, the second one uses this low-level to define a higher-level API. This looks like trivial code organization. By your argument they should be merged?

No dSFMT.jl is totally different. It's a completely different module and the most functions has dsfmt in it's name. export and using makes it slightly harder to track for the few functions/variables that doesn't have dsfmt in the name but that's relatively minor.

@rfourquet
Copy link
Member Author

FWIW, that still doesn't change the fact that you need to go through each file to find where a function/method is defined.

Sorry, I still don't understand you. Base is full of cases where different methods of the same function are defined in different files, this is a necessary trade-off. Here, with the current state of the PR, almost only the rand[!] function is defined in two different files, RNGs.jl for the 2 RNGs, and generic.jl for AbstractRNG. IMHO it's ok.

No dSFMT.jl is totally different

I gave this example because you complained by saying:

[a few internal functions] are usually not used in the same file they are defined which makes them much harder to read.

That's exactly what happens with dSFMT.jl, which defines functions which are used in the random.jl (in master), so your "totally different" is exaggerated.

##### GLOBAL_RNG

function dsfmt_gv_srand()
# Temporary fix for #8874 and #9124: update global RNG for Rmath
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess this comment probably isn't relevant any more (but can be dealt with another time)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf. #19980, which removes this fix, but whose merge preconditions are not satisfied.

rand(r::AbstractRNG, T::Type, dims::Dims) = rand!(r, Array{T}(dims))
rand( T::Type, dims::Dims) = rand(GLOBAL_RNG, T, dims)
rand(r::AbstractRNG, T::Type, d1::Integer, dims::Integer...) = rand(r, T, Dims((d1, dims...)))
rand( T::Type, d1::Integer, dims::Integer...) = rand(T, Dims((d1, dims...)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully equivalent, but any reason in particular for changing this from the previous rand(T::Type, d1::Integer, dims::Integer...) = rand(T, tuple(Int(d1), convert(Dims, dims)...)) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it fit on one line, although it's still 94 chars for the longest line here (I considered renaming d1 to d to get 92 chars...).

#### random integers

@inline rand(r::MersenneTwister,
::Type{T}) where {T<:Union{Bool,Int8,UInt8,Int16,UInt16,Int32,UInt32}} =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this formatting looks pretty strange - isn't there a shorthand name for this union somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a shorthand for this one. For the formatting, it's standard to align vertically arguments names, but I agree the where there looks strange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right I thought this was BitIntegerSmall but looks like that doesn't include (U)Int32 on 32 bit, or Bool. I'd maybe do something like

@inline (rand(r::MersenneTwister, ::Type{T}) where
    {T<:Union{Bool,Int8,UInt8,Int16,UInt16,Int32,UInt32}}) = rand_ui52_raw(r) % T

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly prefer the way it is now than this alternative...

@yuyichao
Copy link
Contributor

yuyichao commented Jul 18, 2017

Base is full of cases where different methods of the same function are defined in different files, this is a necessary trade-off.

There are many differences between the two cases. For different files in Base, they are generally seperated by namespaces or the types they are operating on. Coupling do exist but are generally done through standard interfaces (convert, indexing etc). This makes it easy to guess where a function is defined in based on their type/name/module most of the time without having to open the file they are defined in and read the description in that file. It also means that the functions defined in other files/modules don't generally affect reading the code since they are usually not the most important part/are just standard exported API with well understood behavior. None of these are the case for random.jl. The rules that separate these files are not random but also not obvious, at least not for anyone who didn't write this and didn't remember the organization by heart, and it is also generally necessary to go across files in order to understand the call chain (unlike different files/submodules in base)

[a few internal functions] are usually not used in the same file they are defined which makes them much harder to read.

That's exactly what happens with dSFMT.jl, which defines functions which are used in the random.jl (in master), so your "totally different" is exaggerated.

Not really, they are internal base functions but are basically dSFMT.jl API and even though they are not public API, they are well namespaced which makes a huge difference for "searchability" and the much harder to read is the point. The fact that dsfmt_gv_srand is defined outside it does mess it up a tiny bit though it's pretty close to the user and, more importantly, in the safe file as the user so it doesn't really affect things much.

@rfourquet
Copy link
Member Author

This makes it easy to guess where a function is defined in based on their type/name/module most of the time

Well, some amount of work is necessary for understanding the organization of many functions/files in base; comparatively, the change here doesn't require a particular amount of work from a casual reader IMO (depend of course on what the reader's task is). And for someone familiar with the random code base, I think this change improves things (a lot, as far as I'm concerned).

Anyway, if you see specific coupling between those new files, please report them. AFAICT, there are none (i.e. no file depends on the "internal workings" (expression taken from wikipedia on coupling) of another file).

@yuyichao
Copy link
Contributor

Well, some amount of work is necessary for understanding the organization of many functions/files in base

Yes but irrelevant. No or minimum amount of work should be required to navigate through the files.

comparatively, the change here doesn't require a particular amount of work from a casual reader IMO

And as someone who follows the function calls in this file from time to time, I can also say that being able to find all functions (that doesn't obviously belong to a different namespace) in the same file helps a lot.

If you want a good example of what's a splitting of files that doesn't require any effort to guess where the function is defined, you can have a look at pkg/. All files are in a module and the namespace makes it extremely clear where I can find the function definition. It takes almost no extra effort to jump between different files even though there are a dozen of them.

Anyway, if you see specific coupling between those new files, please report them. AFAICT, there are none (i.e. no file depends on the "internal workings" (expression taken from wikipedia on coupling) of another file).

Ignore the word "coupling" if you want. What I mean is "calling internal functions that are not clearly namespaced or separated by input types from a different file". This is (the other) half of the problem and examples includes,

  • rand_ui52
  • rand_ui52_raw
  • rand_ui10_raw
  • rand_ui23_raw

@rfourquet
Copy link
Member Author

Yes but irrelevant. No or minimum amount of work should be required to navigate through the files.

I was giving context, so it's certainly not irrelevant. Please show a bit more courtesy. I don't disagree with your ideal, and I tried to improve things on this front.

I can also say that being able to find all functions (that doesn't obviously belong to a different namespace) in the same file helps a lot.

I agree, for some tasks; but it doesn't mean having everything in the same file (in random.jl in this case) is better overall.

calling internal functions that are not clearly namespaced or separated by input types from a different file

Regarding rand_ui52 etc: imagine they don't exist yet, and the files in random are split as in this PR. Now I want to introduce those functions to improve performance: would you really ask me to then merge all the files which use those functions, because it's otherwise too difficult to follow where/how they are used? I'm open to naming them differently, but they are already kind of namespaced (they start with rand_ui); plus, in their definition, they are separated by input types (if I understood what you mean).

We fail to understand each other, and I start to feel uneasy with this discussion as I don't see it as so useful in this public forum. I will leave to others discussing with you the merits/disadvantages of this approach.

rand(rng, RangeGenerator(r))

rand!(rng::AbstractRNG, A::AbstractArray,
r::UnitRange{<:Union{Signed,Unsigned,BigInt,Bool,Char}}) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does rand! support UnitRange{Char} but rand doesn't?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, UnitRange{Char} cannot be constructed so this Char can be removed from this Union, which was on my todo list already.
This is a leftover from an old commit following Char not being an Integer anymore (cf. #13600).

yy = -log(rand(rng))
yy+yy > xx*xx && return (rabs >> 8) % Bool ?
-ziggurat_nor_r - xx :
ziggurat_nor_r + xx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would probably be clearer written out longhand as

            if yy+yy > xx*xx
                if (rabs >> 8) % Bool
                    return -ziggurat_nor_r - xx
                else
                    return ziggurat_nor_r  + xx
                end
            end

@yuyichao
Copy link
Contributor

would you really ask me to then merge all the files which use those functions, because it's otherwise too difficult to follow where/how they are used?

Very good argument for not doing it in the first place. There isn't a super clear cut between the files and yes, when functions like these are added, it's an evidence that the file should be merged or organized differently.

(they start with rand_ui)

So yes, for this particular case (which is still only the internal function part of the issue) I'll be perfectly fine with it if all of them are defined in a file named ui.jl or rand_ui.jl.

@StefanKarpinski
Copy link
Sponsor Member

There's a net increase of 127 lines. Is that from headers and wrapping or does this add any code?

@rfourquet
Copy link
Member Author

rfourquet commented Jul 25, 2017

If I count non-blank/non-comment lines, then there is a net increase of only 26 lines. In the commit wrapping lines at 92 chars shows a net increase of 64 lines (including few comments), few of which have been reverted subsequently (by compressing a couple of functions). So without counting in detail, this confirms that basically no code has been added.

@tkelman
Copy link
Contributor

tkelman commented Jul 25, 2017

I'd squash together any commit that partially undoes something or makes significant revisions to the same file that was touched in a previous commit. Breaking the commits apart into contiguous sections of code that are moved unmodified vs style revisions would also make sense.

@ViralBShah
Copy link
Member

FWIW, I have always wanted to move these away into a directory and some reorganization.

@StefanKarpinski
Copy link
Sponsor Member

I'm all for this reorganization and don't understand what all the fuss is about. Sure, we can leave all files as-is forever since the language doesn't impose a particular file structure. But occasionally you want to restructure things better and more logically – and who better to reorganize the RNG code than @rfourquet? Please carry on! Group the commits logically as you see fit and merge when ready.

@Sacha0
Copy link
Member

Sacha0 commented Jul 25, 2017

Look at e.g. the HigherOrderFns module for a quite long file, which still is easy to follow with the descriptive header up top and the rest of the file is clearly separated into sections.

In support of thoughtful organization into multiple files, judiciously splitting SparseArrays.HigherOrderFns into a few files would almost certainly be better than the present organization :). Best!

Edit:

who better to reorganize the RNG code than @rfourquet? Please carry on!

💯

@rfourquet
Copy link
Member Author

@nanosoldier runbenchmarks("random", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@rfourquet
Copy link
Member Author

@nanosoldier runbenchmarks("random", vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan


function DSFMT_state(val::Vector{Int32} = zeros(Int32, JN32))
length(val) == JN32 ||
throw(DomainError(length(val), "Expected length: $JN32."))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're just moving this, but seems odd that this isn't a DimensionMismatch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will put this on my todo list.

Fill the array `A` with random numbers following the exponential distribution
(with scale 1).

# Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you seem to have lost the formatting change to

# Examples
```jldoctest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oups sorry. I don't know if this come from the rebase... will re-check everything carefully. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I rebased again by redoing all the changes manually commit by commit to double-check. I think this should be good to go now.

julia> x1 = rand(rng, 2)
2-element Array{Float64,1}:
0.590845
0.766797

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline between # Examples and jldoctest shouldn't be added back here

julia> x1 = rand(2)
2-element Array{Float64,1}:
0.590845
0.766797

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline between # Examples and jldoctest shouldn't be added back here

julia> rand(Int, 2)
2-element Array{Int64,1}:
1339893410598768192
1575814717733606317

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or here

@rfourquet
Copy link
Member Author

The one CI failure is a timeout (travis).

@rfourquet rfourquet merged commit b12c778 into master Aug 17, 2017
@rfourquet rfourquet deleted the rf/split-random branch August 17, 2017 13:15
rfourquet added a commit that referenced this pull request Sep 13, 2020
This function doesn't exist since commit 72b5b80 (#22752).
Fix #37541.
rfourquet added a commit that referenced this pull request Sep 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants