Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce a given transform #208

Closed
romainVala opened this issue Jun 26, 2020 · 55 comments · Fixed by #353
Closed

Reproduce a given transform #208

romainVala opened this issue Jun 26, 2020 · 55 comments · Fixed by #353
Labels
enhancement New feature or request

Comments

@romainVala
Copy link
Contributor

🚀 Feature
Given the history information saved in the sample for a given transform, how can I apply the exact same transform on a new volume

Motivation
After doing a lot of data augmentation for training and validation, I would like to visually check some specific volume (the one having the worst loss for instance).
Since saving all the transformed volume can become a very high cost (in terme of disk space) one prefer to save the transform's parameter, and have the possibility to re-create the exact same transformed volume

Pitch
I implemented once for the RandomElasticDeformation

    def apply_given_transform(self, sample: Subject, bspline_params) -> dict:
        for image_dict in sample.get_images(intensity_only=False):
            if image_dict[TYPE] == LABEL:
                interpolation = Interpolation.NEAREST
            else:
                interpolation = self.interpolation
            image_dict[DATA] = self.apply_bspline_transform(
                image_dict[DATA],
                image_dict[AFFINE],
                bspline_params,
                interpolation,
            )
        return sample

Alternatives
may be a more generic way would be to separate the apply_transform code in 2 disctinct par
one to get the random param
one to apply them
so that the apply code could be reuse in both case
....

Additional context

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

I agree this would be useful, and that what you suggest makes sense. But maybe a way to do this without changing so much code everywhere is to get the current seed of the PyTorch random number generator and add it to the parameters dictionary, or rather to use a torch.Generator and pass put the torch.Generator.initial_seed() in the dictionary. I'll need to investigate this.

Useful links:

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

I'll keep adding some findings here, for reference.

 Py 3.8.2 (torchio)  ~ipython                                                                                                                      ✔  100%01:2313:48:29
Python 3.8.2 (default, Mar 26 2020, 10:43:30)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: torch.save(torch.get_rng_state(), '/tmp/rng.pth')

In [3]: torch.rand(1)
Out[3]: tensor([0.8218])

In [4]:
Do you really want to exit ([y]/n)?
 Py 3.8.2 (torchio)  ~ipython                                                                                                                     ✔  100%41.41s13:49:12
Python 3.8.2 (default, Mar 26 2020, 10:43:30)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: torch.set_rng_state(torch.load('/tmp/rng.pth'))

In [3]: torch.rand(1)
Out[3]: tensor([0.8218])

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

I think this would do:

In [2]: %run /tmp/romain.py

In [3]: do_random_stuff()
Out[3]: (tensor([0.6550]), 6129670197736294978)

In [4]: do_random_stuff(6129670197736294978)
Out[4]: (tensor([0.6550]), 6129670197736294978)

Where /tmp/romain.py is

import torch

def do_random_stuff(seed=None):
    if seed is None:
        seed = torch.seed()
    else:
        torch.manual_seed(seed)

    return torch.rand(1), seed

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

So you could access the seed used for a specific transformation from the transform history, and use it as an argument to reproduce the exact result. Would that work for you? If yes, I'll open a PR soon.

@romainVala
Copy link
Contributor Author

this sound like a nice lazy solution !
sound good to me, if it makes the job

I guess one should be careful where to put the get_rng_state, and the set (in transform init ... ?)

may be something to check, is that only torch is used to get random number (and not numpy.random), it seems to be the case for your transforms (but not the one I added, but well I'll change it)

Thanks

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

get_rng_state is not needed, just torch.seed().

I strongly suggest you do change your random generators to PyTorch in your implementation. I've had trouble before with multiprocessing. Using PyTorch to generate random numbers ensures good results. I think it's related to pytorch/pytorch#5059.

@romainVala
Copy link
Contributor Author

get_rng_state is not needed, just torch.seed().

why that, for each transform you need to get the specific random state, to save in the history ... non ? (and not the seed from the begining of the code)

just to be sure, i am clear enough:
I want to apply 1000 transform (with or not a manual seed at the begining) (so each one has different parameters)

and from the saved history of a given transform (let say the transform 600 ) I want to retrieve the same random param

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

Maybe this script can help me explain what I mean:

import torch


class RandomMultiply:
    def __init__(self, seed=None):
        self.seed = seed

    def get_params(self):
        if self.seed is None:
            seed = torch.seed()
        else:
            seed = self.seed
            torch.manual_seed(seed)
        random_number = torch.rand(1).item()
        return seed, random_number

    def __call__(self, x):
        seed, random_number = self.get_params()
        x.n = random_number
        random_params = {
            'seed': seed,
            'random_number': random_number,
        }
        x.history.append(random_params)
        return x

class Number:
    def __init__(self):
        self.n = 100
        self.history = []


def main():
    number = Number()
    num_iterations = 1000
    transform = RandomMultiply()
    results = []
    for _ in range(num_iterations):
        number = transform(number)
        results.append(number.n)

    print('I love this result:', results[600])
    random_params = number.history[600]
    print('The number was {random_number} and the seed was {seed}'.format(**random_params))

    new_number = Number()
    reproducer = RandomMultiply(seed=random_params['seed'])
    reproduction = reproducer(new_number)
    print('\nReproduced result:', reproduction.n)
    print('Random parameters:', reproduction.history[0])



if __name__ == "__main__":
    main()

Output:

I love this result: 0.36917614936828613
The number was 0.36917614936828613 and the seed was 9140603397059799205

Reproduced result: 0.36917614936828613
Random parameters: {'seed': 9140603397059799205, 'random_number': 0.36917614936828613}

However, sometimes I get an error when I run the script:

I love this result: 0.738286554813385
The number was 0.738286554813385 and the seed was 17402630908179140122
Traceback (most recent call last):
  File "/tmp/romain.py", line 55, in <module>
    main()
  File "/tmp/romain.py", line 48, in main
    reproduction = reproducer(new_number)
  File "/tmp/romain.py", line 18, in __call__
    seed, random_number = self.get_params()
  File "/tmp/romain.py", line 13, in get_params
    torch.manual_seed(seed)
  File "/usr/local/Caskroom/miniconda/base/envs/torchio/lib/python3.8/site-packages/torch/random.py", line 34, in manual_seed
    return default_generator.manual_seed(seed)
RuntimeError: Overflow when unpacking long

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

It works if use torch.manual_seed(seed & (2**63 - 1)). Reference: https://discuss.pytorch.org/t/initial-seed-too-large/28832/2

@romainVala
Copy link
Contributor Author

I love this results too !
thanks very much for the example, it is convincing

It is still not 100% clear what torch.seed is doing .. is it equivalent to torch.manuall_seed call but with a non deterministic random number ? (so at the end it keeps the randomness)

The thing I do not like with the use of torch.manual_seed(seed) is that it influence all the code using random that follow, but for my use case I guess I can live with that

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

It is still not 100% clear what torch.seed is doing .. is it equivalent to torch.manuall_seed call but with a non deterministic random number ? (so at the end it keeps the randomness)

I think so, yes. It probably calls a very low-level RNG and sets the seed with that.

The thing I do not like with the use of torch.manual_seed(seed) is that it influence all the code using random that follow, but for my use case I guess I can live with that

I guess the ideal implementation would include resetting the random state after getting the random parameters, but hopefully that's not a problem.

@romainVala
Copy link
Contributor Author

but you will break reproducibility of the code in one wants to compare results obtained before and after the changes, ... I guess it is ok, one just need to know.

@romainVala
Copy link
Contributor Author

I guess the ideal implementation would include resetting the random state after getting the random parameters, but hopefully that's not a problem.

why not to use torch.get_rng_state() ? (it won't modify the random state ...

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

Yes but

  1. You would need to save 5056 Bytes at each iteration, instead of just 8
  2. How do you implement it? Would you have to pass the random state to the transform init?

@fepegar
Copy link
Owner

fepegar commented Jun 26, 2020

Maybe we can see how other libraries using PyTorch deal with this, if they do.

@romainVala
Copy link
Contributor Author

  1. or this one, https://pytorch.org/docs/master/generated/torch.Generator.html#torch.Generator
    may be the size is also big ...

  2. you get it and set it (if not None) in get_param like you did ...

  3. with either choice if transforms are doing several call to get_param this becomes heavy ...

@romainVala
Copy link
Contributor Author

Maybe we can see how other libraries using PyTorch deal with this, if they do.

yes sure ... no hurry, (and my first option is may bet not so much code ... mostly copy paste ) it has also the advantage to be explicit and more independent of the deep pytorch seeding ...

@fepegar
Copy link
Owner

fepegar commented Jul 1, 2020

For reference: if I add a manual_seed at the top of the script, I'd expect to always get the same result, but it wasn't happening. Using the torch.random.fork_rng() context manager when calling get_params helped to reproduce a number at the end of the script, but not the transforms:

import torch
torch.manual_seed(25074390)

class RandomMultiply:
    def __init__(self, seed=None):
        self.seed = seed

    def get_params(self):
        if self.seed is None:
            seed = torch.seed()
        else:
            seed = self.seed
            torch.manual_seed(seed & (2**63 - 1))
        random_number = torch.rand(1).item()
        return seed, random_number

    def __call__(self, x):
        with torch.random.fork_rng():
            seed, random_number = self.get_params()
        x.n = random_number
        random_params = {
            'seed': seed,
            'random_number': random_number,
        }
        x.history.append(random_params)
        return x

class Number:
    def __init__(self):
        self.n = 100
        self.history = []


def main():
    number = Number()
    num_iterations = 1000
    transform = RandomMultiply()
    results = []
    for _ in range(num_iterations):
        number = transform(number)
        results.append(number.n)

    print('I love this result:', results[600])
    random_params = number.history[600]
    print('The number was {random_number} and the seed was {seed}'.format(**random_params))

    new_number = Number()
    reproducer = RandomMultiply(seed=random_params['seed'])
    reproduction = reproducer(new_number)
    print('\nReproduced result:', reproduction.n)
    print('Random parameters:', reproduction.history[0])

    print('A reproducible random number:', torch.rand(1).item())


if __name__ == "__main__":
    main()

Output:

 Py 3.8.2 (torchio)  ~/tmppython romain.py                                                                                                           ✔  68% (1:50)  14:03:23
I love this result: 0.9413431882858276
The number was 0.9413431882858276 and the seed was 11325910110281226277

Reproduced result: 0.9413431882858276
Random parameters: {'seed': 11325910110281226277, 'random_number': 0.9413431882858276}
A reproducible random number: 0.1055898666381836
 Py 3.8.2 (torchio)  ~/tmppython romain.py                                                                                                           ✔  68% (1:50)  14:03:25
I love this result: 0.4556175470352173
The number was 0.4556175470352173 and the seed was 11295961902343433785

Reproduced result: 0.4556175470352173
Random parameters: {'seed': 11295961902343433785, 'random_number': 0.4556175470352173}
A reproducible random number: 0.1055898666381836

I'll try with generators.

@fepegar
Copy link
Owner

fepegar commented Jul 1, 2020

New test using generators. I'm probably doing something wrong. The same seed generates different results. Each call had a different manual seed at the top of the script.

    def get_params(self):
        generator = torch.Generator()
        if self.seed is None:
            seed = generator.initial_seed()
        else:
            seed = self.seed
            generator.manual_seed(seed)
        random_number = torch.rand(1).item()
        return seed, random_number
 Py 3.8.2 (torchio)  ~/tmppython romain.py                                                                                                           ✔  57% (1:33)  14:19:36
I love this result: 0.1055898666381836
The number was 0.1055898666381836 and the seed was 67280421310721

Reproduced result: 0.1055898666381836
Random parameters: {'seed': 67280421310721, 'random_number': 0.1055898666381836}
A reproducible random number: 0.1055898666381836


 Py 3.8.2 (torchio)  ~/tmppython romain.py                                                                                                           ✔  56% (1:24)  14:20:03
I love this result: 0.9377833008766174
The number was 0.9377833008766174 and the seed was 67280421310721

Reproduced result: 0.9377833008766174
Random parameters: {'seed': 67280421310721, 'random_number': 0.9377833008766174}  <---- same seed as above but yields different number
A reproducible random number: 0.9377833008766174

@romainVala
Copy link
Contributor Author

I do not see the problem ...
Resproduced result, is working fine

The fact that you obtain different result when you run twice the code, is what is expected, (with random number) non ?
if you want to reproduce, you need to add a manual seed at the begining of your script

what do I miss ?

@fepegar
Copy link
Owner

fepegar commented Jul 1, 2020

I do not see the problem ...
Resproduced result, is working fine

Note that I changed the seed at the top for each run.

What I want is that the result of a transform is deterministic given a certain input and seed, independently of the manual seed I set at the beginning. BUT if I do set a manual seed at the beginning of the script, I want exactly the same thing to happen every time I run it.

@romainVala
Copy link
Contributor Author

ok I see, I miss the seconde line of your exemple ...
I try to test here but I get an error,
File "test.py", line 10, in get_params
seed = torch.seed()
File "/opt/anaconda3/lib/python3.7/site-packages/torch/random.py", line 45, in seed
torch.cuda.manual_seed_all(seed)
File "/opt/anaconda3/lib/python3.7/site-packages/torch/cuda/random.py", line 111, in manual_seed_all
_lazy_call(cb)
File "/opt/anaconda3/lib/python3.7/site-packages/torch/cuda/init.py", line 146, in _lazy_call
callable()
File "/opt/anaconda3/lib/python3.7/site-packages/torch/cuda/random.py", line 109, in cb
default_generator.manual_seed(seed)
RuntimeError: Overflow when unpacking long

I may not have the last python neither the last pytorch

@fepegar
Copy link
Owner

fepegar commented Jul 1, 2020

Try this #208 (comment)

@romainVala
Copy link
Contributor Author

it is already in you example, ...
my error is on the line 10 :
seed = torch.seed()

@fepegar
Copy link
Owner

fepegar commented Jul 1, 2020

Oh, sorry!

@fepegar
Copy link
Owner

fepegar commented Jul 1, 2020

@GReguig
Copy link
Contributor

GReguig commented Jul 6, 2020

Hello,

if I may add a little piece of information that I remember from my computer science lectures, a pseudo-random number generator computes a progression from the given seed. Therefore, to be able to find the same result, we not only need to set the same seed but also we need to call the random functions as many times as it was called the first time.

@GReguig
Copy link
Contributor

GReguig commented Jul 7, 2020

Hello again,

I went from your code @fepegar but used the rng_state instead of the seed in order to reproduce the results (I think the rng_state contains the full information of the generator). This worked. To get the result, I used torch.get_rng_state() and torch.set_rng_state() instead of the seed.

import torch
torch.manual_seed(25074390)

class RandomMultiply:
    def __init__(self, rng_state=torch.get_rng_state()):
        self.rng_state = rng_state
        torch.set_rng_state(rng_state)

    def get_params(self):
        """
        if self.seed is None:
            seed = torch.seed()
        else:
            seed = self.seed
            torch.manual_seed(seed & (2**63 - 1))
        """
        rng_state = torch.get_rng_state()
        random_number = torch.rand(1).item()
        return rng_state, random_number

    def __call__(self, x):
        #with torch.random.fork_rng():
        rng_state, random_number = self.get_params()
        x.n = random_number
        random_params = {
            'rng_state': rng_state,
            'random_number': random_number,
        }
        x.history.append(random_params)
        return x

class Number:
    def __init__(self):
        self.n = 100
        self.history = []


def main():
    number = Number()
    num_iterations = 1000
    transform = RandomMultiply()
    results = []
    for _ in range(num_iterations):
        number = transform(number)
        results.append(number.n)

    print('I love this result:', results[600])
    random_params = number.history[600]
    print("The number was {random_number} and the rng_state was {rng_state}".format(**random_params))

    new_number = Number()
    reproducer = RandomMultiply(rng_state=random_params["rng_state"])
    reproduction = reproducer(new_number)
    print('\nReproduced result:', reproduction.n)
    print('Random parameters:', reproduction.history[0])
    torch.set_rng_state(reproduction.history[0]["rng_state"])
    print('A reproducible random number:', torch.rand(1).item())


if __name__ == "__main__":
    main()

@fepegar
Copy link
Owner

fepegar commented Jul 7, 2020

Hi @GReguig,

Thanks for investigating this with us. We briefly discussed passing the random state in #208 (comment), but I'd like to use the seed if possible, so that the API is not changed and because it's easier and more intuitive for users. However, if that solution satisfies these conditions and we can't think of anything else, I guess we'll need to follow that direction.

P.S.: you can specify the language for syntax highlighting in markdown, i.e.

print(1 if b is None else 'hey')

vs

print(1 if b is None else 'hey')

@GReguig
Copy link
Contributor

GReguig commented Jul 7, 2020

Oh alright, sorry about that. I will investigate this a little bit more then.

Thanks

@romainVala
Copy link
Contributor Author

romainVala commented Jul 9, 2020

well, I still think that my first proposition, is a better choice

Going through seeding as you propose, seems to me very risky:
_ will it work in 10 years, with a totally new python version ?

It will if people use the same version of the libraries they used for the experiment. Which they should be logging. You could say this about every software.

_ setting manually the seed may have some non control effect. For instance imagine a for loop with a composition of transform, where you want to set the first transform, and make random the second one, (there is a risk to get the second one not random)

If we managed to set the seed only for the current transform, that would not happen.

_ saving for each parameter, a long seeding params, will double the saved history

"Double" feels arbitrary, and for e.g. the coarse field in RandomElasticDeformation it would be smaller to save the RNG state than the parameters.

On the other hand have the transform reproduced from only the parameter saved in the history, is much more elegant (as it is explicit and can be exported in a text form, so stable over time)

It is indeed nice to see what the parameters were in the log.

In terme of code refractoring it is may be a good occasion to improve the code
The way I see it:
we should remove the get_param called (and their saved to history) from each transform, to call a generic (unique fonction) that take as input, a dict with param_name (usefull to save in the history), a range, a type (vector dimension int float ...) this is may be the more tricky but should be possible.

I'm not sure I follow here.

by the way, some simple transforms like scaling do not save any parameter in the history, which makes it impossible to know a posteriori, that it has been applied. In a reproducibility perspective all transform should save an history.

I thought they were not necessary because the user is meant to know what transforms they applied. But I agree it'd be useful if you apply e.g. HistogramStandardization with a probability < 1.

So to resume, random or user define set, saving to history, reading (from csv ?) should be handle in the abstract transform class

Ok, I agree with most of these. Still not sure about the implementation. I do like the get_params method, it's also used by transforms in torchvision.transforms. Maybe it could be implemented so that the method returns args and kwargs that would make the transform generate the same result.

Ghiles could help for the PR if you are interested

It'll definitely be great to have some help once we know what we're after.

@romainVala
Copy link
Contributor Author

romainVala commented Jul 9, 2020

I thought they were not necessary because the user is meant to know what transforms they applied. But I agree it'd be useful if you apply e.g. HistogramStandardization with a probability < 1.

after 10 different experiment, personally I do not remember all ...
But even if I would remember, the objective is to reproduce a given specific composition of transform, (so I need all of them in the history)

my idea was that each transform were calling the same get_param fonction (instead of what is done today, to define them in each transform class)
but i am not that sure, so I let you decide

let us know how we can help

@fepegar
Copy link
Owner

fepegar commented Jul 9, 2020

after 10 different experiment, personally I do not remember all ...
But even if I would remember, the objective is to reproduce a given specific composition of transform, (so I need all of them in the history)

Ideally you would log the status of your code (last commit), or even the code itself, in the experiment log. I use TensorBoard and sacred for all this stuff, it's great.

my idea was that each transform were calling the same get_param fonction (instead of what is done today, to define them in each transform class)
but i am not that sure, so I let you decide

Where would you put that function? How does it know which parameters it needs to generate?

let us know how we can help

Thanks! I will.

@fepegar
Copy link
Owner

fepegar commented Jul 9, 2020

I think that the new transform in #222 is a perfect example of why using a seed/random state would be more helpful. How do you reproduce that? Actually, how do you reproduce e.g. a RandomNoise transform? Even if you save the exact same statistics, the result will always differ if the random state is different.

Maybe the best solution is adding a flag to ask the user if they'd like the random state to be saved.

@fepegar fepegar added the enhancement New feature or request label Jul 9, 2020
@GReguig
Copy link
Contributor

GReguig commented Jul 9, 2020

How about randomly generating a seed for a given transform (and saving it) and setting it specifically for this transform (and the given subject) ? This would only require to save the generated seed and resetting it at the beginning of the transform.

I would imagine something like that (the key point would be to systematically manually reset the seed)

import torch
torch.manual_seed(25074390)


def gen_seed():
    """
    Random seed generator to avoid overflow
    :return: 
    """
    return torch.randint(0, 10**8, (1,))


class RandomMultiply:

    def get_params(self):
        torch.manual_seed(seed=self.seed)
        random_number = torch.rand(1).item()
        return self.seed, random_number

    def __call__(self, x, seed=gen_seed()):
        self.seed = seed
        seed, random_number = self.get_params()
        x.n = random_number
        random_params = {
            'seed': seed,
            'random_number': random_number,
        }
        x.history.append(random_params)
        return x


class Number:
    def __init__(self):
        self.n = 100
        self.history = []


def main():
    number = Number()
    num_iterations = 1000
    transform = RandomMultiply()
    results = []
    for _ in range(num_iterations):
        number = transform(number)
        results.append(number.n)

    print('I love this result:', results[600])
    random_params = number.history[600]
    print('The number was {random_number} and the seed was {seed}'.format(**random_params))

    new_number = Number()
    reproducer = RandomMultiply()
    reproduction = reproducer(new_number, seed=random_params['seed'])
    print('\nReproduced result:', reproduction.n)
    print('Random parameters:', reproduction.history[0])
    torch.manual_seed(reproduction.history[0]["seed"])
    print('A reproducible random number:', torch.rand(1).item())


if __name__ == "__main__":
    main()

Would it work ?

@romainVala
Copy link
Contributor Author

I think that the new transform in #222 is a perfect example of why using a seed/random state would be more helpful. How do you reproduce that? Actually, how do you reproduce e.g. a RandomNoise transform? Even if you save the exact same statistics, the result will always differ if the random state is different.

Maybe the best solution is adding a flag to ask the user if they'd like the random state to be saved.

Ok you make a point there,
I always had in mind deterministic transform (motion bias spatial deformation) ... but you 're right, when it comes to random noise, it may be interesting to have the exact same random selection

But it may not be that critical, to reproduce the same transform, with different random values, if the std (and mean) is the same
I guess it depends on the exact application in mind...

"Double" feels arbitrary, and for e.g. the coarse field in RandomElasticDeformation it would be smaller to save the RNG state than the parameters.

This is an other point in your favor, although of less importance

@romainVala
Copy link
Contributor Author

romainVala commented Jul 10, 2020

Where would you put that function? How does it know which parameters it needs to generate?

I guess I would put it in the abstract class transform (all transform herit from it right?)
I imagine as input a list of dict witth he field
"name" (variable name use to save in the history)
"type" (int float ...)
"shape (numpy shape)
"range" [range of the randome number]
"random_typ" [to uniform , gausian, ect ...

This may be a good add one of this code refractory, Today each transform use uniform selection, but I am sure there are other interesting way and here we would have to code it only once ... so that it can be used in any transform ...

@fepegar
Copy link
Owner

fepegar commented Jul 10, 2020

I guess I would put it in the abstract class transform (all transform herit from it right?)
I imagine as input a list of dict witth he field
"name" (variable name use to save in the history)
"type" (int float ...)
"shape (numpy shape)
"range" [range of the randome number]
"random_typ" [to uniform , gausian, ect ...

I really don't see the point of this. It looks like a job for method inheritance, which is the way it works now.

@fepegar
Copy link
Owner

fepegar commented Jul 10, 2020

@romainVala it looks like I edited your comment #208 (comment) when I tried to reply to it! Sorry about that :D

@fepegar
Copy link
Owner

fepegar commented Jul 10, 2020

Ok you make a point there,
I always had in mind deterministic transform (motion bias spatial deformation) ... but you 're right, when it comes to random noise, it may be interesting to have the exact same random selection

But it may not be that critical, to reproduce the same transform, with different random values, if the std (and mean) is the same
I guess it depends on the exact application in mind...

I do think that if we want to make it reproducible, it needs to be 100% reproducible.

I think @GReguig's solution might work! I added an option to input a manual seed at the beginning of the script and got this:

$ for i in {,,99,99}; do python g.py $i; done

No initial seed
I love this result: 0.7315794825553894
The number was 0.7315794825553894 and the seed was tensor([90200970])

Reproduced result: 0.7315794825553894
Random parameters: {'seed': tensor([90200970]), 'random_number': 0.7315794825553894}
A reproducible random number: 0.7315794825553894






No initial seed
I love this result: 0.8289129137992859
The number was 0.8289129137992859 and the seed was tensor([17905163])

Reproduced result: 0.8289129137992859
Random parameters: {'seed': tensor([17905163]), 'random_number': 0.8289129137992859}
A reproducible random number: 0.8289129137992859






Initial seed: 99
I love this result: 0.30399465560913086
The number was 0.30399465560913086 and the seed was tensor([87414401])

Reproduced result: 0.30399465560913086
Random parameters: {'seed': tensor([87414401]), 'random_number': 0.30399465560913086}
A reproducible random number: 0.30399465560913086






Initial seed: 99
I love this result: 0.30399465560913086
The number was 0.30399465560913086 and the seed was tensor([87414401])

Reproduced result: 0.30399465560913086
Random parameters: {'seed': tensor([87414401]), 'random_number': 0.30399465560913086}
A reproducible random number: 0.30399465560913086

@fepegar
Copy link
Owner

fepegar commented Jul 10, 2020

Although I don't think it's a good idea to add an argument to __call__...

@fepegar
Copy link
Owner

fepegar commented Jul 10, 2020

Although I don't think it's a good idea to add an argument to call...

I guess that the point of this stuff is to reproduce a single transform once, so it wouldn't be that bad.

@fepegar
Copy link
Owner

fepegar commented Jul 12, 2020

I've implemented in #226 a solution similar to what @GReguig proposed. The seed is now passed when calling, not when instantiating. It took me a while, but I've managed to add the transforms history as a string, so it can be propagated even to the patches sampled by the Queue. I settled to saving the seed only, and only for random transforms. That should be enough to reproduce anything exactly.

import torchio as tio
colin = tio.datasets.Colin27()
transform = tio.RandomAffine()
transformed = transform(colin)
transform_class, seed = transformed.get_applied_transforms()[-1]
new_transform = transform_class()
new_transformed = new_transform(colin, seed=seed)

In that example, transformed and new_transformed should contain the same data.

@romainVala
Copy link
Contributor Author

romainVala commented Jul 13, 2020

Sound good !
I try to understand your PR, may be I miss something: the random parameter are no more saved in the history (only the seed ) ?
or is it done implicitly in __call__ function from RandomTransform ?

@fepegar
Copy link
Owner

fepegar commented Jul 13, 2020

That's correct, only the seed is saved at the moment. I've been thinking about how to retrieve the parameters. Maybe they can be stored in self when __call__ is called, so they can be retrieved after.

@romainVala
Copy link
Contributor Author

please, no after retieval ...
(In some use-case, I use a CNN to predict the random noise std (as it is apply in randomnoise transform) so it is very convenient to have it easily in subject history ...

@fepegar
Copy link
Owner

fepegar commented Jul 13, 2020

What about a method in the transform that returns the parameters given a seed? That way you wouldn't need to apply it to get them.

seed = [get seed from sample history]
parameters = transform.get_parameters_from_seed(seed)

@romainVala
Copy link
Contributor Author

I guess I would put it in the abstract class transform (all transform herit from it right?)
I imagine as input a list of dict witth he field
"name" (variable name use to save in the history)
"type" (int float ...)
"shape (numpy shape)
"range" [range of the randome number]
"random_typ" [to uniform , gausian, ect ...

I really don't see the point of this. It looks like a job for method inheritance, which is the way it works now.

Ok sorry if I am not clear enough (or if it is a bad idea)
So let me try to explain it again, because I think it can give a solution to also explicitly the parameters

Instead of having each transform that define the function get_param, why not having a common function to do it (and each transform will call it )
Doing so will have the advantage to program saving of the chosen parameters into history, only once, (instead of specifically in each transform as it was done before you PR)
Same and unique code, will also allow to add some random type option (instead of choosing the transformed parameters with a uniform distribution, one could choose a normal distribution. or anything else ..)
I agree this is a requirement (changing the random distribution) that should be in an other issue, (since it is quite decorrelate to this one)

but this is then an extra argument to the use of a generic get_params function ...
Does it make sense ?

@romainVala
Copy link
Contributor Author

What about a method in the transform that returns the parameters given a seed? That way you wouldn't need to apply it to get them.

seed = [get seed from sample history]
parameters = transform.get_parameters_from_seed(seed)

I prefer, the previous, way where the parameter were directly include in the history,
but I guess that using this get_parameter_from_seed, I could then easily reprogram it

May be it is more "user friendly" if the specific behavior (saving the parameter and or the seeding) could be a user choice ...

@fepegar
Copy link
Owner

fepegar commented Jul 13, 2020

Saving all the parameters is quite problematic. As we said before, to get fully reproducible results, the whole image would need to be saved for e.g. RandomNoise, which is not feasible. Saving only the mean and std is a gray solution that doesn't make the thing fully reproducible.

Saving a large number of parameters for e.g. RandomElasticDeformation is complicated as they would need to be converted into text so that they can later be collated in the batches. This would probably mean that some precision would be lost in conversion, and new functions would need to be written to convert from one type to the other.

I think we are the library in the field that allows for so much reproducibility with the current system. I think this should be enough, especially because it's a niche (although important) feature that most users will never care about.

@romainVala
Copy link
Contributor Author

yes we discuss, about the difficulty to save large number of parameter, and this was the reason why you remove them from the sample dictionary, (with the argument it will make torch collate job to complicated (although it was working ...)
An now the same argument, to remove them from the history, (I guess I will "loos" again although there is no real problem to have them in the history)

I think they are 2 distinct objectives, that are overlapping in this discussion:
1_ having a way to reproduce any given transform
2_ having the need to log / access each transform parameters, for further used (: for instance to use them to train a regression task, or just to log them to be able to review (check) a given experience , or just study the prediction bias correlation with any transform's parameter

I thought it was more convenient to have the 2_ functionality directly in torchio (as a user option), but having a way to program it outside torchio, is fine to me (with get_parameters_from_seed should make work)

I agree reproducibility is important (but never used), but not sure most user will never care to get the transform parameters. My point 2_ is not for reproducibility, but is needed, to look at transform induce bias on model prediction. (ok nobody is doing that today, they only want the data augmentation stuff, and the deep model will solve it for you, but I think it is important to better understand model generalization ...)

By the way, having the option for controling different random distribution for parameters (normal uniform poisson ect ..) worth a new issue ?

@fepegar
Copy link
Owner

fepegar commented Jul 13, 2020

yes we discuss, about the difficulty to save large number of parameter, and this was the reason why you remove them from the sample dictionary, (with the argument it will make torch collate job to complicated (although it was working ...)

The parameters were not being saved in the batch! That has been added in #226.

I think they are 2 distinct objectives, that are overlapping in this discussion:
1_ having a way to reproduce any given transform
2_ having the need to log / access each transform parameters, for further used (: for instance to use them to train a regression task, or just to log them to be able to review (check) a given experience , or just study the prediction bias correlation with any transform's parameter
I thought it was more convenient to have the 2_ functionality directly in torchio (as a user option), but having a way to program it outside torchio, is fine to me (with get_parameters_from_seed should make work)
I agree reproducibility is important (but never used), but not sure most user will never care to get the transform parameters. My point 2_ is not for reproducibility, but is needed, to look at transform induce bias on model prediction. (ok nobody is doing that today, they only want the data augmentation stuff, and the deep model will solve it for you, but I think it is important to better understand model generalization ...)

I think that the current implementation is a good compromise between feasibility in terms of how much information must be stored to reproduce/extract parameters and amount of work for the user.

By the way, having the option for controling different random distribution for parameters (normal uniform poisson ect ..) worth a new issue ?

Yes, please.

@romainVala
Copy link
Contributor Author

Concerning my comment on you PR
#226 (comment)

may be it is better to discuss here (since there still are some confusion (or disagreement ...) about the main objective)

the reproducibity is not only to reproduce the random parameter of a given transform, but to be able (from a saved history) to reproduce, all transform applied to the image

As it was done before, I could read from the history, the order of transform that were apply (I was just missing some "basic" transform because no information was added to the history (for instance RescaleIntensity), but this could be solve by adding input parameter with
sample.add_transform

So I think we should keep add_transform, as it was before, and just add an extra key to handel the seeding

Do you think we can mixt both way ?

If not how can I retrieve that a RescaleIntensity has been applied with a given percentile ...?

@fepegar
Copy link
Owner

fepegar commented Jul 21, 2020

So I've been thinking of something like this in RandomTransform:

    def get_params_from_seed(self, seed: Optional[int] = None):
        with torch.random.fork_rng():
            torch.manual_seed(seed)
            params = self.get_params()  # pylint: disable=no-member
        return params

But as I said in #226, sometimes you have different parameters for each image in the sample. Using 4D images (#238) will make this even more complex, but I guess that's a different issue.

@romainVala
Copy link
Contributor Author

sound good,
if as suggested here #226 (comment) we do save each seed for eash image in the sample, then it is ok too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants