-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Mamba Architecture #1157
Add Mamba Architecture #1157
Conversation
This seems to train well without parallelism, but am having bugs in a conversion script I wrote (gibberish output). I'll be checking for differences in output between a single instantiated layer for this versus the |
Worked after state-spaces/mamba#211 ! Got performance from a 160m (trained with Pythia config, untied embed + unembed) on-par with the Mamba-130m results in paper! Will cleanup code slightly and add sample configs, then mark ready for review. This also pairs with a DeeperSpeed PR I'll make that should allow for holding specified parameters in fp32 despite Deepspeed trying to cast everything to 16 bit. I want to check out adding tensor parallelism for Mamba too, but will do that later. |
Awesome! Great work. I have some TP ideas that we can discuss on discord. |
Ready for initial review! Pairs with EleutherAI/DeeperSpeed#61 , which I'd appreciate feedback on if the approach there is acceptable. |
Note for the future: In addition to the E.g.
One drawback from this strategy is that we're adding a lot of annoying book-keeping for the user. It's already a bit confusing that we're putting the attention-free |
In the future, we should add support for the triton RMSNorm kernel introduced by mamba. Noting here and adding a TODO for later. https://github.com/state-spaces/mamba/blob/v1.2.0/mamba_ssm/ops/triton/layernorm.py |
closes #1148
This PR adds Mamba to NeoX, along with flags for turning on/off the selective scan + conv1d + full mamba_inner_fn kernels.
For now, does not support parallelism, but want to investigate adding Tensor Parallel to this.