Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Preserve input dtype in Dropout layer output #1323

Merged
merged 3 commits into from
Aug 13, 2024

Conversation

bhargavyagnik
Copy link
Contributor

Proposed changes

I re-ordered the equation that calculated Dropout to preserve the input dtype. The underlying problem in Issue #1321 mentioned #1321 (comment)

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

- Modified Dropout implementation to ensure that the output dtype matches the input dtype.
- This resolves the issue ml-explore#1321
@awni
Copy link
Member

awni commented Aug 13, 2024

Thanks for the fix! Two comments:

@awni
Copy link
Member

awni commented Aug 13, 2024

I think the correct function to use is assertEqual or we can use assertTrue( x==y) .

Those are indeed bugs. The right function is assertEqual. Would be great if you could send a PR to fix it.

@bhargavyagnik
Copy link
Contributor Author

Yeah, I'll add them in this PR itself. There were less than I anticipated.

- Revised test cases to align with updated dropout code
- Fixed assertion method: replaced self.assertTrue with self.assertEqual for accurate comparisons in test_nn.py -> test_rope, test_alibi and test_dropout,
@bhargavyagnik
Copy link
Contributor Author

I added the changes, let me know if you have any suggestions. Thanks for the review !!

@@ -32,7 +32,7 @@ def __call__(self, x):

mask = mx.random.bernoulli(self._p_1, x.shape)

return (1 / self._p_1) * mask * x
return (mask * x) / self._p_1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this. You should still do the scaling with a multiplication by (1 / self._p1) which will be a single divide followed by a bunch of multiplies as opposed to a bunch of divides. Divide is more expensive than multiply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay I'll make that change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @awni
Sorry, I'm still learning so please excuse me if this is out of topic and a very small impact question but ..

import mlx.core as mx
import timeit

x = mx.random.uniform(shape=(100000, 1000))
y = mx.random.uniform(shape=(100000, 1000))

x = x * 124.215**6
y = y * 135.353**6
def direct_division():
    return mx.eval(x/y)

def multiply_reciprocal():
    return mx.eval(x * (1/y))

print("Direct division:", timeit.timeit(direct_division, number=1000))
print("Multiply reciprocal:", timeit.timeit(multiply_reciprocal, number=1000))

The output is :
Direct division: 6.718534375000672
Multiply reciprocal: 10.876147124999989

I read some articles over https://stackoverflow.com/questions/1117674/why-is-multiplying-cheaper-than-dividing?rq=3, https://stackoverflow.com/questions/15745819/why-is-division-more-expensive-than-multiplication

I understood that division is more expensive than multiplication, but when I observe the performance of direct_division, it is faster than multiply_reciprocal..
I also experimented by multiplying x and y with 1.7976931348623157e+308 but still seems direct_division is faster. What could be the underlying reason?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. In your example y is an array so you are dividing a bunch of stuff in both cases. So the second one is strictly more work than the first (it includes the first in fact).

In the Dropout example p is a scalar. So a benchmark would be:

import mlx.core as mx
import timeit

x = mx.random.uniform(shape=(100000, 1000))
y = 5.0

def direct_division():
    return mx.eval(x/y)

def multiply_reciprocal():
    return mx.eval(x * (1/y))

print("Direct division:", timeit.timeit(direct_division, number=1000))
print("Multiply reciprocal:", timeit.timeit(multiply_reciprocal, number=1000))

Also you probably still won't see a time difference because both operations are memory bandwidth bound. But if you measure power use (which isn't so easy to do reliably) you would might see lower power consumption for the second. It could be marginal.. but in this case why not since it's trivial to use the slightly more efficient version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, i see. I was just exploring so thought to test it out. Thank you so much for the quick reply!

Comment on lines +1002 to +1003
self.assertEqual(y.shape, x.shape)
self.assertEqual(y.dtype, mx.bfloat16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing those!!

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Will merge when the tests clear.

@awni awni merged commit a098bc9 into ml-explore:main Aug 13, 2024
5 checks passed
@bhargavyagnik bhargavyagnik deleted the dropout_bug branch August 13, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants