Awesome-state-space-models

Collection of papers/repos on state-space models.

(Potential) SOTA

Main idea: input-dependent gating.

Mamba (https://arxiv.org/abs/2312.00752) GitHub

$$g_k = \sigma(Linear(x_k)),$$ $$h_{k+1} = (1-g_k) h_{k} + g_k x_k.$$

The activation is SiLU / Swish. The continuous form is $$\frac{dh_t}{dt} = -g_t (x_t - h_t).$$

Various (unofficial) implementations:
Gated Linear Attention (GLA) (https://arxiv.org/abs/2312.06635) GitHub

GitHub-flash-linear-attention

On the replacement of transformer/attention by SSMs

[Language model] Pretraining Without Attention (https://arxiv.org/abs/2212.10544) GitHub

Feature: Bidirectional Language Modeling with State-space Model
[RL] Structured State Space Models for In-Context Reinforcement Learning (https://arxiv.org/abs/2303.03982) GitHub
[Diffusion] Diffusion Models Without Attention (https://arxiv.org/abs/2311.18257) (NeurIPS 2023 Workshop on Diffusion Models)
[Graph] Recurrent Distance-Encoding Neural Networks for Graph Representation Learning (https://arxiv.org/abs/2312.01538) GitHub
[MoE] MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts (https://arxiv.org/abs/2401.04081) GitHub
[Bio] U-Mamba, a versatile network designed specifically for biomedical image segmentation. (https://arxiv.org/abs/2401.04722) GitHub
[Vision] VMamba: Visual State Space Model. (https://arxiv.org/abs/2401.10166) GitHub
[Tabular data] MambaTab: A Simple Yet Effective Approach for Handling Tabular Data (https://arxiv.org/abs/2401.08867)
[MoE] BlackMamba: Mixture of Experts for State-Space Models (https://www.zyphra.com/blackmamba) GitHub
[RWKV-TS] RWKV-TS: Beyond Traditional Recurrent Neural Network for Time Series Tasks (https://arxiv.org/abs/2401.09093) GitHub
[Vision] Vision Mamba (Vim) is 2.8× faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248×1248. (https://arxiv.org/abs/2401.09417) GitHub
[Vision] SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation. (https://arxiv.org/abs/2401.13560) GitHub
[Token-free language models] MambaByte: Token-free Selective State Space Model.（https://arxiv.org/abs/2401.13660）[GitHub](https://github.com/kyegomez/MambaByte)

Token-free language models learn directly from raw bytes and remove the bias of subword tokenization.
[Vision] MambaMorph: a Mamba-based Backbone with Contrastive Feature Learning for Deformable MR-CT Registration. (https://arxiv.org/abs/2401.13934) GitHub
[Video] Vivim: a Video Vision Mamba for Medical Video Object Segmentation (https://arxiv.org/pdf/2401.14168.pdf) GitHub
LOCOST: State-Space Models for Long Document Abstractive Summarization (https://arxiv.org/abs/2401.17919) GitHub
[Graph] Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces (https://arxiv.org/abs/2402.00789) GitHub
Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining (https://arxiv.org/abs/2402.03302) GitHub
[Bio] nnMamba: 3D Biomedical Image Segmentation, Classification and Landmark Detection with State Space Model (https://arxiv.org/abs/2402.03526) GitHub
IS MAMBA CAPABLE OF IN-CONTEXT LEARNING? (https://arxiv.org/abs/2402.03170)
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks (https://arxiv.org/abs/2402.04248)
[Graph] Graph Mamba: Towards Learning on Graphs with State Space Models (https://arxiv.org/abs/2402.08678) GitHub
Spectral State Space Models (https://arxiv.org/abs/2312.06837v3) GitHub

ICLR 2024 submissions

I try to use the most important 2-3 sentences in the abstract to summarize the paper. (https://openreview.net/group?id=ICLR.cc/2024/Conference)

FlashFFTConv(https://openreview.net/forum?id=gPKTTAfYBp)

FlashFFTConv speeds up exact FFT convolutions by up to 8.7 over PyTorch and achieves up to 4.4 speedup end-to-end. GitHub.
Variational quantization for state space models(https://openreview.net/forum?id=EAkjVCtRO2)

In this work, we propose a new forecasting model that combines discrete state space hidden Markov models with recent neural network architectures and training procedures inspired by vector quantized variational autoencoders. We introduce a variational discrete posterior distribution of the latent states given the observations and a two-stage training procedure to alternatively train the parameters of the latent states and of the emission distributions.
Efficient Long Sequence Modeling via State Space Augmented Transformer(https://openreview.net/forum?id=xuxYaBMd9F)

We propose SPADE, short for State Space Augmented Transformer. Specifically, we augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers.

SSM + Transformer GitHub
StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization(https://openreview.net/forum?id=BwG8hwohU4)

Our analysis identifies this ``curse of memory'' as a result of the recurrent weights converging to a stability boundary, suggesting that a reparameterization technique can be effective. To this end, we introduce a class of reparameterization techniques for SSMs that effectively lift its memory limitations. Besides improving approximation capabilities, we further illustrate that a principled choice of reparameterization scheme can also enhance optimization stability.

Stability, more on parameterisation
Robustifying State-space Models for Long Sequences via Approximate Diagonalization(https://openreview.net/forum?id=DjeQ39QoLQ)

We introduce a generic, backward-stable ''perturb-then-diagonalize'' (PTD) methodology, which is based on the pseudospectral theory of non-normal operators, and which may be interpreted as the approximate diagonalization of the non-normal matrices defining SSMs. Based on this, we introduce the S4-PTD and S5-PTD models. Through theoretical analysis of the transfer functions of different initialization schemes, we demonstrate that the S4-PTD/S5-PTD initialization strongly converges to the HiPPO framework, while the S4D/S5 initialization only achieves weak convergences.

Robustness, more on initialization
From generalization analysis to optimization designs for state space models(https://openreview.net/forum?id=EGjvMcKrrl)

In this paper, we theoretically study the generalization of SSMs and propose improvements to training algorithms based on the generalization results. Specifically, we give a data-dependent generalization bound for SSMs, showing an interplay between the SSM parameters and the temporal dependencies of the training sequences. Leveraging the generalization bound, we (1) set up a scaling rule for model initialization based on the proposed generalization measure, which significantly improves the robustness of SSMs to different temporal patterns in the sequence data; (2) introduce a new regularization method for training SSMs to enhance the generalization performance. Numerical results are conducted to validate our results.
A 2-Dimensional State Space Layer for Spatial Inductive Bias(https://openreview.net/forum?id=BGkqypmGvm)

We leverage an expressive variation of the multidimensional State Space Model (SSM). Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme. Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (ViT) significantly enhances performance for multiple ViT backbones and across datasets. The new layer is effective even with a negligible amount of additional parameters and inference time.

Vision task
Hieros: Hierarchical Imagination on Structured State Space Sequence World Models(https://openreview.net/forum?id=5j6wtOO6Fk)

We propose HIEROS, a hierarchical policy that learns time abstracted world representations and imagines trajectories at multiple time scales in latent space. HIEROS uses an S5 layer-based world model, which predicts next world states in parallel during training and iteratively during environment interaction. Due to the special properties of S5 layers, our method can train in parallel and predict next world states iteratively during imagination. This allows for more efficient training than RNN-based world models and more efficient imagination than Transformer-based world models.

Reinforcement Learning (Use SSM instead of Transformer)
S4++: Elevating Long Sequence Modeling with State Memory Reply(https://openreview.net/forum?id=bdnw4qjfH9)
1. Non-Stable-States (NSS): Significant state variance discrepancies arise among discrete sampling steps, occasionally resulting in divergence.
2. Dependency Bias: The unidirectional state space dependency in SSM impedes the effective modeling of intricate dependencies. In this paper, we conduct theoretical analysis of SSM from the even-triggered control (ETC) theory perspective and first propose the presence of NSS Phenomenon.
Our findings indicate that NSS primarily results from the sampling steps, and the integration of multi-state inputs into the current state significantly contributes to the mitigation of NSS. Building upon these theoretical analyses and findings, we propose a simple, yet effective, theoretically grounded State Memory Reply (SMR) mechanism that leverages learnable memories to incorporate multi-state information into the current state.

Stability
Mamba: Linear-Time Sequence Modeling with Selective State Spaces(https://openreview.net/forum?id=AL1fq05o7H)

Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba).

Time-dependent or input-dependent state-space models + Hardware acceleration

A very nice analysis in Chinese: https://zhuanlan.zhihu.com/p/661237120.
Gated recurrent neural networks discover attention(https://openreview.net/forum?id=rfSfDSFrRL)

These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers.

By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers.

Naive question: What's the difference in contribution sense against Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.

Universality of SSM + Optimization verification over ICL
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling(https://openreview.net/forum?id=02Ug9N8DCI)

We develop GateLoop, a foundational sequence model that generalizes linear recurrent models such as S4, S5, LRU and RetNet, by employing data-controlled state transitions. Furthermore, we derive an $O(l^2)$ surrogate-attention mode, revealing remarkable implications for Transformer and recently proposed architectures. While many existing models solely rely on data-controlled cumulative sums for context aggregation, our findings suggest that incorporating data-controlled complex cumulative products may be a crucial step towards more powerful sequence models.

Data-controlled state transitions sound similar to 9, TODO comparison Official GitHub Unofficial GitHub
Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors (https://openreview.net/forum?id=PdaPky8MUn)

In this work, we show that random initialization leads to gross overestimation of the differences between architectures and that pretraining with standard denoising objectives, using only the downstream task data, leads to dramatic gains across multiple architectures and to very small gaps between Transformers and state space models (SSMs). In stark contrast to prior works, we find vanilla Transformers to match the performance of S4 on Long Range Arena when properly pretrained, and we improve the best reported results of SSMs on the PathX-256 task by 20 absolute points. Subsequently, we analyze the utility of previously-proposed structured parameterizations for SSMs and show they become mostly redundant in the presence of data-driven initialization obtained through pretraining. Our work shows that, when evaluating different architectures on supervised tasks, incorporation of data-driven priors via pretraining is essential for reliable performance estimation, and can be done efficiently.

GitHub
Mastering Memory Tasks with World Models (https://openreview.net/forum?id=1vDArHJ68h)

To improve temporal coherence, we integrate a new family of state space models (SSMs) in world models of MBRL agents to present a new method, Recall to Imagine (R2I). This integration aims to enhance both long-term memory and long-horizon credit assignment. Through a diverse set of illustrative tasks, we systematically demonstrate that R2I establishes a new state-of-the-art performance in challenging memory and credit assignment RL tasks, such as Memory Maze, BSuite, and POPGym. We also show that R2I is faster than the state-of-the-art MBRL method, DreamerV3, resulting in faster wall-time convergence.

Reinforcement Learning GitHub

Arxiv

RWKV (https://arxiv.org/abs/2305.13048): GitHub
RetNet (https://arxiv.org/abs/2307.08621) GitHub
Zoology (https://arxiv.org/abs/2312.04927) GitHub
Structured state-space models are deep Wiener models (https://arxiv.org/abs/2312.06211)

NeurIPS 2023

State-space Models with Layer-wise Nonlinearity are Universal Approximators with Exponential Decaying Memory (https://arxiv.org/abs/2309.13414)

The authors show that the layer-wise nonlinearity is enough to achieve the universality when the state-space models are multi-layer.

It is also shown that similar to traditional nonlinear recurrent neural networks, SSMs also suffer from the aymptotically exponential memory decay.
Sparse Modular Activation for Efficient Sequence Modeling (SMA) (https://arxiv.org/abs/2306.11197) GitHub

SSM + Attention, SOTA at LRA.

We design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM.
Laughing Hyena Distillery: Extracting Compact Recurrences from Convolutions (https://arxiv.org/abs/2310.18780)

Given a convolution-based Hyena model, the authors want to extract the recurrent weights for the convolution kernel so that the convolution model can be converted into a recurrent models. Method used are based on Hankel matrix SVD.
Structured State Space Models for In-Context Reinforcement Learning (https://arxiv.org/abs/2303.03982) GitHub

We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel, allowing us to tackle reinforcement learning tasks. We show that our modified architecture runs asymptotically faster than Transformers in sequence length and performs better than RNN's on a simple memory-based task.
Convolutional State Space Models for Long-Range Spatiotemporal Modeling (https://arxiv.org/abs/2310.19694) GitHub
Hierarchically Gated Recurrent Neural Network for Sequence Modeling (https://paperswithcode.com/paper/hierarchically-gated-recurrent-neural-network) GitHub

ICML 2023

Resurrecting Recurrent Neural Networks for Long Sequences (https://icml.cc/virtual/2023/oral/25438)
Hyena Hierarchy: Towards Larger Convolutional Language Models (https://arxiv.org/abs/2302.10866) GitHub
Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series (https://icml.cc/virtual/2023/oral/25554) GitHub

Before 2023

See github repo State-spaces for S4, including HiPPO, LSSL, SaShiMi, DSS, HTTYH, S4D, and S4ND.

GSS
Simplified State Space Layers for Sequence Modeling (S5) (https://openreview.net/forum?id=Ai8Hw3AXqks) GitHub
[Parallel scan] Parallelizing Linear Recurrent Neural Nets Over Sequence Length (https://openreview.net/forum?id=HyUNwulC-)
Bayesian state-space models GitHub.

Another very good note is: http:https://personal.strath.ac.uk/gary.koop/GSE_Bayesian/Bayesian_State_Space_Methods.pdf
Mega: Moving Average Equipped Gated Attention (Mega) GitHub
Annotated S4 By Sasha Rush and Sidd Karamcheti GitHub

TODO

Summarize the important unsolved questions in state-space models. (Personal viewpoint)

Scale-up, how to train a larger state-space model with better performance. Interesting topics include but are not limited to scaling law. Scale-up depth (l) / width (m) / sequence length (T) / hidden state size (S). How to scale up the hidden states and parameters at the same time? Mamba's hidden states are 1D ($S=l * 16m$) while linear attention hidden states are 2D ($S=l * m^2$). Intuitively the scaleup of parameters should be faster than the scaleup of hidden states. Because the recurrent model is approximating the $(x_k,h_k)->(y_k,h_{k+1}) \in \mathbb{R}^{m + S}$ map.
Speed-up, how to make the SSM layer faster. This topic can borrow a lot of idea from Flash-Attention. This has been done in FlashFFTConv, Mamba and gated linear attention. Another viewpoint is from classical control theory, use system identification idea (Hankel matrix decomposition). Laughing-hyena
Cheaper, given a large model, how to perserve the model performance and run the inference with fewer FLOPs.
1. Quantization belongs to this part. However, using lower precision might cause training instability.
2. Maybe we can consider a minimal realization of state-space model: https://ocw.mit.edu/courses/6-241j-dynamic-systems-and-control-spring-2011/resources/mit6_241js11_lec21/ Trim the hidden dimension from large dimension to some principal component.
Theoretical guarantees
1. Rates for approximation/generalization/optimization.
2. Stability in approximation/optimization: Does SSM resolve the difficulty in traditional nonlinear RNNs such as GRU and LSTM?
3. Initialization scheme, for example, task-dependent initialization.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
CNN.md		CNN.md
In-context-learning.md		In-context-learning.md
README.md		README.md
RNN.md		RNN.md
Transformers.md		Transformers.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-state-space-models

(Potential) SOTA

On the replacement of transformer/attention by SSMs

ICLR 2024 submissions

Arxiv

NeurIPS 2023

ICML 2023

Before 2023

TODO

Star History

About

Releases

Packages

Wzhengkai/Awesome-state-space-models

Folders and files

Latest commit

History

Repository files navigation

Awesome-state-space-models

(Potential) SOTA

On the replacement of transformer/attention by SSMs

ICLR 2024 submissions

Arxiv

NeurIPS 2023

ICML 2023

Before 2023

TODO

Star History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages