Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test train split of physnet2012 is scientifically flawed. #409

Closed
2 tasks done
Twofigo opened this issue May 11, 2024 · 2 comments
Closed
2 tasks done

Test train split of physnet2012 is scientifically flawed. #409

Twofigo opened this issue May 11, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@Twofigo
Copy link

Twofigo commented May 11, 2024

1. System Info

A simple recent installation of pypots,
Using scripts derived from brew pots for some comparative model training.

2. Information

  • The official example scripts
  • My own created scripts

3. Reproduction

use gene_physionet2012() for anything

4. Expected behavior

The gene_physionet2012() function in pots.data does a test-train split of the whole dataset but uses a random seed. This means separate programs can give different splits, which breaks the scientific utility of comparing results between runs. This function should take a random seed to fix the split between executions.

Furthermore, physionet2012 comes in three sections from the start that tsdb merge into one pile. I originally assumed this was the split I was given when running gene_physionet2012, but that appears to have been an incorrect assumption.

To be fully correct in my scientific methodology, I now need to rewrite both tsdb and gene_physionet2012 and rerun every experiment to this point using a fixed seed (at the minimum), or use the proper original split if I want to publish my results without a disclaimer.

To be clear, the examples in brewPots does include a util function that sets the torch and numpy random seed globally. But this is a terrible workaround. You can only trust results when running the same code and code-path, as any other library accidentally calling a numpy.random before calling gene_physionet2012 would increment the random state, and introduce a systematic flaw in results between code-paths.

@Twofigo Twofigo added the bug Something isn't working label May 11, 2024
@WenjieDu
Copy link
Owner

Hi there 👋,

Thank you so much for your attention to PyPOTS! You can follow me on GitHub to receive the latest news of PyPOTS. If you find PyPOTS helpful to your work, please star⭐️ this repository. Your star is your recognition, which can help more people notice PyPOTS and grow PyPOTS community. It matters and is definitely a kind of contribution to the community.

I have received your message and will respond ASAP. Thank you for your patience! 😃

Best,
Wenjie

@WenjieDu WenjieDu self-assigned this May 14, 2024
@WenjieDu
Copy link
Owner

Hi @Twofigo, thank you for raising this discussion, but "scientifically flawed", serious? The func just splits the dataset in a way that doesn't follow the official. Here https://physionet.org/content/challenge-2012/1.0.0 the official doc says "Four thousand records comprise training set A, and the remaining records form test sets B and C". Hence, A, B, and C three sets should be assumed to share the same data distribution, right? So the splitting isn't a problem, but we do thank you for your feedback, and I think, in the returned result from TSDB load func, we could include the information about which set the sample is from, or add an argument to only return samples from set a/b/c.

Let's talk about the random seed setting. Note that although random seed can help but is never a gold guarantee to reproduce your results, and there're are so many discussions on the net for you to refer to. The only way to ensure your generated dataset the same everywhere (across dev env, settings, and devices) is serializing it, namely saving it into a file. And we do provide such a function, try pypots.data.saving.save_dict_into_h5() or pypots.data.saving.pickle_dump(). PyPOTS models can also directly take saved h5 files for model training.

Of course the random state will change when you call numpy.random again. This is not an issue. What else do you expect, everything keeps intact? then why call it again? All libraries or toolkits provide standard APIs to help make people's lives easier, but that doesn't mean you can use them without understanding what they're doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants