[WIP] RWKV4Neo the RNN and GPT Hybrid Model #20809

ArEnSc · 2022-12-17T20:50:12Z

What does this PR do?

Adds the model from issue
Fixes # (#20737)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@younesbelkada
@ArthurZucker

younesbelkada · 2022-12-19T09:42:19Z

Hi @ArEnSc !
Thanks for starting over the PR 💪
Let us know whenever you need help with @ArthurZucker !

ArEnSc · 2022-12-19T19:48:42Z

Hi @ArEnSc ! Thanks for starting over the PR 💪 Let us know whenever you need help with @ArthurZucker !

Will do still doing some research, just figured out how the training notebook works, model executes in notebook so that's a positive

ArEnSc · 2023-01-05T14:36:28Z

Update: tracing the model and came up with a state based api for the RNN inference mode on my own code base to experiment with

younesbelkada · 2023-01-05T14:38:33Z

Thanks a lot for the status update! Feel free to ping whenever you need help

xloem · 2023-01-16T14:36:24Z

Sometimes I look at working on this a little. Here are my notes and possible tasks, started 2023-01-16.

The template appears to be from a T5 style model. The RWKV state could be the encoder hidden state (a little intuitive) and/or the past key values (normative generation). It will take some algebra and tests to add input state to the GPT training form from the RNN inference form.
The tensorflow loading code appears complicating to me. I might move it out to another file for now.
The embeddings can likely be adjusted to reflect parts "i" and "ii" of the high level outline below
It could be helpful to organize the file to retain layout similarity with blinkdl’s files.
For below outline, next step is reviewing timemix.
Draft of architecture (maybe leave out optional parts to start).

High level:
1. word embeddings emb
2. layernorm ln0
  - optional 2-axis trained position embeddings seen in training code for image modeling pos_emb_x pos_emb_y. this is converted to 1-axis pos_emb and used prior to ln0 in inference.
3. layers of blocks
  1. layernorm ln1
  2. timemix self attention time_mix_k, time_mix_v, time_mix_r, time_first, time_decay, key, value, receptance, output. time_first and time_decay are kept as float32 in inference.
  3. layernorm ln2
  4. feedforward channelmix time_mix_k, time_mix_r, key, value, receptance (see channelmix section below)
  - timemix self attention optionally replaced with feedforward channelmix for block 0 in training code
  - for one optional block, tiny attention tiny_ln, tiny_q, tiny_k, tiny_v, tiny_mask seen in training code, inference code in development
  - optionally inference code uses what looks like a numeric stability trick to extract a factor of 2 from the weights every 6 layere
4. layernorm ln_out
  - optional "copy" attention head_q, head_k, copy_mask then summed to head in training code, inference code in development
5. linear language modeling head
  - for training loss, blink presently has a function after cross entropy called L2Wrap to reduce magnitudes
GPT(training) and RNN (inference) equivalence:
- i think special training initialization values may be used in timemix, channelmix
- for inference time_decay = -exp(time_decay) is factored out when loaded, but for training this is done in the forward pass.
- 5 state elements per layer:
  - 0 = ChannelMix/FF xx
  - 1 = TimeMix/SA xx
  - 2 = aa
  - 3 = bb
  - 4 = pp in inference, o in training
TimeMix:
1. the previous state is shifted into the x vector to make xx. in training this is done by "time shifting" with nn.ZeroPad2d((0, 0, 1, -1)); in single token inference it is passed as state element 1, which is then replaced by x.
2. linear interpolation between the old state xx and the new state x, weighting x by a ratio of time_mix_k, time_mix_v, and time_mix_r to make xk, xv, and xr respectivly.
3. k = key @ xk
4. v = value @ xv
5. sr = sigmoid(receptance @ xr) # called simply r in inference code
- the GPT training form of this is now handed off to a hand-written cuda kernel, compiled on first run, from cuda/wkv_cuda.cu
  - kernel parameters: B = batchsize; T = sequence length; C = channel count; _w = time_decay; _u = time_first; _k = k; _v = v; _y = wkv.
  - i think this used to be a convolution; i'm not sure whether it still is
  - o and no appear to be running values for magnitude management in exponential space, initialized to -1e38; p and q are initialized to 0
  - k and v are indexed by thread so the token offset may represent different subregions. i'm not quite clear on that and should test or ask.
  1. no = max(o, time_first[channel] + k[token])
  2. A = exp(o - no) # this is e1 in the RNN form
  3. B = exp(time_first[channel] + k[token] - no) # this is e2 in RNN
  4. wkv[token] = (A * p + B * v[token]) / (A * q + B)
  5. no = max(time_decay[channel] + o, k[token])
  6. A = exp(time_decay[channel] + o - no)
  7. B = exp(k[token] - no)
  8. p = A * p + B * v[token]
  9. q = A * q + B
  10. o = no; token += 1
- ... here would be the remaining core algebra and code inspection
- WIP unified summary of wkv kernel between inference and training:
  1. ww = time_first + k[token]
  2. next_pp = max(pp, ww)
  3. A = exp(pp - next_pp ...
- rwkv = sr * wkv
- return output @ rwkv
ChannelMix:
1. the previous state is shifted into the x vector to make xx. in training this is done by "time shifting" with nn.ZeroPad2d((0, 0, 1, -1)); in single token inference it is passed as state element 0, which is then replaced by x.
2. linear interpolation between the old state xx and the new state x, weighting x by a ratio of time_mix_k and time_mix_r to make xk and xr respectivly.
3. r = sigmoid(receptance @ xr)
4. k = square(relu(key @ xk))
5. kv = value @ k
6. rkv = r * kv
7. return rkv
review or improve model file further

Lundez · 2023-01-17T12:16:00Z

@ArEnSc do you need any help?

ArEnSc · 2023-01-17T18:47:28Z

@ArEnSc do you need any help?

if you want to help pm me! on discord, otherwise I should have something end of week minor update

younesbelkada · 2023-01-23T09:46:10Z

Hi @ArEnSc,
Can you share with us your discord handle? Thanks!

ArEnSc · 2023-01-23T15:25:43Z

Hi @ArEnSc, Can you share with us your discord handle? Thanks!

ARENSC#5905
yeah still working on it haha it will be a while

ArEnSc · 2023-01-30T03:34:57Z

Working on having GPT Encoder to generate the context and RNN mode inference and sharing weights

ArEnSc · 2023-01-30T03:35:17Z

Deleted a bunch of not needed stuff

for Training and etc

ArthurZucker · 2023-03-15T15:07:52Z

Added the [WIP] Label to prevent the bot from coming back 😉

sgugger · 2023-04-11T19:34:20Z

@ArEnSc Please let us know if you won't have time to finish this PR. The model is heavily requested as you may see from the linked issue, do you want us to take over this PR and finish this?

ArEnSc · 2023-04-12T06:30:38Z

@ArEnSc Please let us know if you won't have time to finish this PR. The model is heavily requested as you may see from the linked issue, do you want us to take over this PR and finish this?

Sure yes, sorry been busy at the hospital these days! I think it's probably important that you guys take this on =)

github-actions · 2023-05-06T15:02:37Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Initial template code

d13243d

ArEnSc marked this pull request as draft December 17, 2022 20:50

xloem mentioned this pull request Jan 16, 2023

Simplify embeddings class to nn.Embedding ArEnSc/transformers#1

Closed

Michael Chung added 3 commits January 27, 2023 11:34

Merge branch 'master' into RWKV4neo

4daaaaf

use word_embeddings

df10e9a

Remove a bunch of non needed code.

fe3142d

Michael Chung added 2 commits February 14, 2023 18:59

Add WKV Kernel for inference in encoder decoder mode.

1c8214b

Looking for all the config parameters before adding the encoder

b7b7ca7

for Training and etc

huggingface deleted a comment from github-actions bot Mar 15, 2023

ArthurZucker changed the title ~~RWKV4Neo the RNN and GPT Hybrid Model~~ [WIP] RWKV4Neo the RNN and GPT Hybrid Model Mar 15, 2023

KerfuffleV2 mentioned this pull request Mar 26, 2023

Support for RWKV rustformers/llm#75

Open

huggingface deleted a comment from github-actions bot Apr 11, 2023

fblgit mentioned this pull request Apr 11, 2023

Add RWKV2 (fast) #17230

Closed

2 tasks

github-actions bot closed this May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] RWKV4Neo the RNN and GPT Hybrid Model #20809

[WIP] RWKV4Neo the RNN and GPT Hybrid Model #20809

ArEnSc commented Dec 17, 2022

younesbelkada commented Dec 19, 2022 •

edited

Loading

ArEnSc commented Dec 19, 2022

ArEnSc commented Jan 5, 2023

younesbelkada commented Jan 5, 2023

xloem commented Jan 16, 2023 •

edited

Loading

Lundez commented Jan 17, 2023

ArEnSc commented Jan 17, 2023

younesbelkada commented Jan 23, 2023 •

edited

Loading

ArEnSc commented Jan 23, 2023

ArEnSc commented Jan 30, 2023

ArEnSc commented Jan 30, 2023

ArthurZucker commented Mar 15, 2023

sgugger commented Apr 11, 2023

ArEnSc commented Apr 12, 2023

github-actions bot commented May 6, 2023

[WIP] RWKV4Neo the RNN and GPT Hybrid Model #20809

[WIP] RWKV4Neo the RNN and GPT Hybrid Model #20809

Conversation

ArEnSc commented Dec 17, 2022

What does this PR do?

Before submitting

Who can review?

younesbelkada commented Dec 19, 2022 • edited Loading

ArEnSc commented Dec 19, 2022

ArEnSc commented Jan 5, 2023

younesbelkada commented Jan 5, 2023

xloem commented Jan 16, 2023 • edited Loading

Lundez commented Jan 17, 2023

ArEnSc commented Jan 17, 2023

younesbelkada commented Jan 23, 2023 • edited Loading

ArEnSc commented Jan 23, 2023

ArEnSc commented Jan 30, 2023

ArEnSc commented Jan 30, 2023

ArthurZucker commented Mar 15, 2023

sgugger commented Apr 11, 2023

ArEnSc commented Apr 12, 2023

github-actions bot commented May 6, 2023

younesbelkada commented Dec 19, 2022 •

edited

Loading

xloem commented Jan 16, 2023 •

edited

Loading

younesbelkada commented Jan 23, 2023 •

edited

Loading