-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early September Improvements + Fixes #95
base: main
Are you sure you want to change the base?
Conversation
This looks great, thanks also for adding in some more comments / docstrings. Could you write a quick check that multi-GPU llc estimation works? doesn't need to converge to any known value, i'd just like to have some indication when someone inevitably breaks this in the future |
Sure, I'll write a test! |
tests/integration/test_sampling.py
Outdated
def distance(s1, s2): | ||
assert s1.keys() == s2.keys(), f"Expected the same keys in both stats, got {s1.keys()} and {s2.keys()}." | ||
assert s1["llc/trace"].shape == s2["llc/trace"].shape, f"Expected the same shape for llc/trace, got {s1['llc/trace'].shape} and {s2['llc/trace'].shape}." | ||
return np.mean((s1["llc/trace"] - s2["llc/trace"])**2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather you use standard numpy or (preferred for tensors) torch closeness checks, such as torch.isclose or numpy.isclose. Also, this is a check for distance averaged across every step of the chain, why choose for this vs. a per-step distance? (The latter is more sensitive to small changes, so might be preferred for this test IMO, though I don't feel strong about that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, isclose or allclose would be better, especially in the per-step distance. My rationale for looking at the sampling chain is to check if the chains diverge, but averaging the errors across steps makes that less effective. Let me fix that real quick:
tests/integration/test_sampling.py
Outdated
sampling_method=SGLD, | ||
optimizer_kwargs=dict(lr=4e-4, localization=100.0), | ||
num_chains=chains, # How many independent chains to run | ||
num_draws=200, # How many samples to draw per chain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think 200 draws is necessary, using 10 draws should still allow the same checks (at lower distances) while being much more computationally efficient
def gpu_default(): | ||
return get_stats("cuda", seed = 100) | ||
|
||
def test_cpu_consistent(cpu_default): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add @pytest.mark.gpu or @pytest.mark.slow where appropriate, this allows our (monorepo) github CI/CD to ignore slow / inapplicable tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my criteria for slowness are vague, let's use >20s as slow, unless we think a test is crucial, in which case I'd be happy up to 1m. my criteria for GPUness are pretty clear :)
Thanks for the feedback! Let me know if any more issues with this PR need to be resolved. Do our Github actions runner have more than 1 core? |
…a returning a different nbeta depending on the dataloader's batch size.
Additions: