Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early September Improvements + Fixes #95

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Early September Improvements + Fixes #95

wants to merge 11 commits into from

Conversation

wz-ml
Copy link
Collaborator

@wz-ml wz-ml commented Sep 5, 2024

Additions:

  • Fixes issue Seeding init_loss calculations #94.
  • Multi-GPU support.
    • Multi-GPU support with one core per GPU.
    • Support with multiple cores per GPU
    • Verify sampling chains match
    • Multi-GPU test.
  • Batch accumulation bugfix (default_nbeta)
    • Batch accumulation test
  • Sampling optimizations
    • Automatic mixed precision
      • AMP tests

@wz-ml wz-ml self-assigned this Sep 5, 2024
@svwingerden svwingerden changed the base branch from main to rfc-for-monorepo September 5, 2024 21:56
@svwingerden svwingerden changed the base branch from rfc-for-monorepo to main September 5, 2024 21:56
@svwingerden
Copy link
Collaborator

This looks great, thanks also for adding in some more comments / docstrings. Could you write a quick check that multi-GPU llc estimation works? doesn't need to converge to any known value, i'd just like to have some indication when someone inevitably breaks this in the future

@wz-ml
Copy link
Collaborator Author

wz-ml commented Sep 5, 2024

Sure, I'll write a test!

def distance(s1, s2):
assert s1.keys() == s2.keys(), f"Expected the same keys in both stats, got {s1.keys()} and {s2.keys()}."
assert s1["llc/trace"].shape == s2["llc/trace"].shape, f"Expected the same shape for llc/trace, got {s1['llc/trace'].shape} and {s2['llc/trace'].shape}."
return np.mean((s1["llc/trace"] - s2["llc/trace"])**2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather you use standard numpy or (preferred for tensors) torch closeness checks, such as torch.isclose or numpy.isclose. Also, this is a check for distance averaged across every step of the chain, why choose for this vs. a per-step distance? (The latter is more sensitive to small changes, so might be preferred for this test IMO, though I don't feel strong about that)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, isclose or allclose would be better, especially in the per-step distance. My rationale for looking at the sampling chain is to check if the chains diverge, but averaging the errors across steps makes that less effective. Let me fix that real quick:

sampling_method=SGLD,
optimizer_kwargs=dict(lr=4e-4, localization=100.0),
num_chains=chains, # How many independent chains to run
num_draws=200, # How many samples to draw per chain
Copy link
Collaborator

@svwingerden svwingerden Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think 200 draws is necessary, using 10 draws should still allow the same checks (at lower distances) while being much more computationally efficient

def gpu_default():
return get_stats("cuda", seed = 100)

def test_cpu_consistent(cpu_default):
Copy link
Collaborator

@svwingerden svwingerden Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add @pytest.mark.gpu or @pytest.mark.slow where appropriate, this allows our (monorepo) github CI/CD to ignore slow / inapplicable tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my criteria for slowness are vague, let's use >20s as slow, unless we think a test is crucial, in which case I'd be happy up to 1m. my criteria for GPUness are pretty clear :)

@wz-ml
Copy link
Collaborator Author

wz-ml commented Sep 6, 2024

Thanks for the feedback! Let me know if any more issues with this PR need to be resolved. Do our Github actions runner have more than 1 core?

…a returning a different nbeta depending on the dataloader's batch size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants