GitHub - JohanesSetiawan/nanoRWKVCustom: Implementation custom architecture on nanoRWKV

nanoRWKV with custom architecture

This is a custom architecture for the nanoRWKV project from here. The architecture is based on the original nanoRWKV architecture, but with some modifications.

Model Structure

RWKV_TimeMix -> RWKV_ChannelMix -> Sliding Window Attention -> GroupedQAttention -> TinyMoE

Here is a brief description of each component:

RWKV_TimeMix: This component applies a time-based mixing operation to the input, which helps the model capture temporal dependencies.
RWKV_ChannelMix: The channel-based mixing operation is performed in this module, allowing the model to learn better representations across different feature channels.
Sliding Window Attention: This attention mechanism operates on a sliding window of the input, enabling the model to efficiently capture local and global dependencies.
GroupedQAttention: This attention module applies a grouped approach to the query, key, and value computations, improving the model's ability to capture multi-headed attention.
TinyMoE: The Tiny Mixture of Experts (TinyMoE) layer is a lightweight and efficient implementation of a Mixture of Experts (MoE) mechanism, which can help the model learn specialized representations.

Detailed Explanation

RWKV_TimeMix: This module applies a time-based mixing operation to the input, which helps the model capture temporal dependencies. It uses several learnable parameters, such as time_maa_k, time_maa_v, time_maa_r, and time_maa_g, to control the mixing process. The module also applies a time-decay mechanism using the time_decay parameter, which allows the model to give more importance to recent inputs. The output of this module is then passed through a series of linear layers, including the receptance, key, value, and gate layers.
RWKV_ChannelMix: This module performs a channel-based mixing operation on the input, allowing the model to learn better representations across different feature channels. It uses a time-shift operation and learnable parameters, such as time_maa_k and time_maa_r, to control the mixing process. The module applies a key, value, and receptance linear layers to the mixed input, and the output is then passed through a sigmoid activation function.
Sliding Window Attention: This attention mechanism operates on a sliding window of the input, enabling the model to efficiently capture both local and global dependencies. The module computes the query, key, and value matrices using a linear layer, and then applies a sliding window attention operation to the input. The output of the sliding window attention is then passed through a final linear layer to produce the final output.
GroupedQAttention: This attention module applies a grouped approach to the query, key, and value computations, improving the model's ability to capture multi-headed attention. The module first computes the query, key, value, and weight matrices using a single linear layer, and then splits these matrices into groups. The attention computation is then performed on each group, and the results are concatenated and passed through a final linear layer.
TinyMoE: The Tiny Mixture of Experts (TinyMoE) layer is a lightweight and efficient implementation of a Mixture of Experts (MoE) mechanism, which can help the model learn specialized representations. The module computes attention scores using a linear layer, and then applies these scores to a set of expert networks to produce the final output. The module also includes an auxiliary loss term that encourages the experts to learn diverse representations, improving the overall performance of the model.

Usage (Inference)

To use this model for inference, you can follow these steps:

Download and paste model weights in the out directory.
Copy and paste the values like: block_size, vocab_size, etc from the table into the class GPTConfig in generate.py.
Then run the following command:

python generate.py --prompt="One day" --max_num_tokens=50 --model_name="ckpt-500"

Explain: This command will generate text based on the input prompt "One day" using the model weights stored in the out directory. The max_num_tokens parameter specifies the maximum number of tokens to generate, and the model_name parameter specifies the name of the model weights file to load. For model_name, you can specify the name of the model weights file without the extension, like "ckpt-500" or "ckpt-1000" or only "ckpt".

Usage (Training)

TO BE UPDATED

Tables

name_model	BLOCK_SIZE	VOCAB_SIZE	N_LAYER	N_HEAD	N_EMBD	NUM_EXPERTS	NUM_ACTIVE_EXPERTS	EXPERT_DIM	DIM	DROPOUT	BIAS	DATASET	DOWNLOAD
ckpt-500.pth	1024	50304	8	8	768	4	4	512	768	0.0	False	tinystories_15k	https://drive.google.com/drive/folders/15WWIE4MJKXlazmonBg3UqnlU9ew-yv6Z?usp=sharing

Results

Prompt: One day

Generated text: One day: Sharing positive bought Isabel a rainbow hug. Her name was an vitamins, so only one favorite thing to cheer she were.

Lily picked up a hay and proudly went to a small portion. She was very happened. When Tommy said it

Generated text length: 227 | Inference time: 3 seconds

Dependencies

torch
numpy
tiktoken

Conclusion

The nanoRWKV model is a custom neural network architecture that combines several cutting-edge techniques, such as time-based and channel-based mixing, sliding window attention, grouped attention, and a Tiny Mixture of Experts (TinyMoE) layer. These components work together to enhance the model's ability to capture both local and global dependencies, as well as to learn specialized representations. The combination of these techniques results in a powerful and efficient model that can be used for a variety of natural language processing tasks.

Contributor

You can contribute to this project to creating a powerfull model for generating text. By forking this project to create your custom model and share it with the community. The community will be grateful for your contribution.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
out		out
.gitignore		.gitignore
.timetracker		.timetracker
README.md		README.md
configurator.py		configurator.py
generate.py		generate.py
modelGenerate.py		modelGenerate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoRWKV with custom architecture

Model Structure

Detailed Explanation

Usage (Inference)

Usage (Training)

Tables

Results

Dependencies

Conclusion

Contributor

About

Releases

Packages

Languages

JohanesSetiawan/nanoRWKVCustom

Folders and files

Latest commit

History

Repository files navigation

nanoRWKV with custom architecture

Model Structure

Detailed Explanation

Usage (Inference)

Usage (Training)

Tables

Results

Dependencies

Conclusion

Contributor

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages