last posts

This artificial intelligence (AI) research explores the expressiveness gap between state space models and transformer language model attention mechanisms


State space models (SSM) are models created to represent dynamic systems using state variables. These models work primarily with time series data and use a collection of first-order differential equations to describe a system. Thanks to recent advances in technology, SSMs have achieved remarkable performance in areas such as financial research and time series forecasting. However, one area where they fall short of expectations is in language modeling tasks, as they cannot match the performance of transformer systems. SSMs are also slower than transformers, despite scaling approximately linear rather than quadratically for sequence length. Researchers believe that the main cause behind this is the underutilization of hardware.

Researchers from Stanford University collaborated with the State University of New York at Buffalo in an effort to understand and close this gap between SSM and the attention mechanisms of the transformative language model. The team documented the results of their investigation in the recent article, ‘Hungry Hungry Hippos: Towards Language Modeling Using State Space Models.’ They also investigated various methods to reduce the hardware barrier between SSMs and attention and proposed a new state-passing algorithm called FlashConv, which achieves 2x speedups on the Long Range Arena benchmark and enables 1 .6x faster than conventional transformer architectures.

The two main features that SSMs struggle with are remembering previously encountered tokens and comparing tokens between sequences. The team used synthetic language modeling tasks focused on text manipulation to identify these expressive gaps between SSMs and attention. The team contributed significantly to the creation of a new SSM layer called Hungry Hungry Hippo (H3) as an alternative to attention in language modeling. The proposed H3 layer stacks two discrete SSMs with multiplicative interactions between input projections and corresponding outputs to simulate comparisons between different points in a sequence. H3 compares favorably to Transformers on OpenWebText in perplexity and matches attention to synthetic languages. Additionally, on the OpenWebText benchmark, a hybrid H3 attention model outperforms processors by 1.0 PPL (perplexity).

How to Monitor Your Machine Learning ML Models (Sponsored)

The researchers also suggested FlashConv, a hierarchical approach, as a superior hardware-aware technique for SSMs. The algorithm was designed to allow SSMs to use contemporary accelerators and operate faster than attention. FlashConv uses the FFT (Fast Fourier Transform) algorithm to improve efficiency on text sequences. Inputs can be split into smaller chunks to accommodate GPU SRAM for efficient computation by taking advantage of the recurring qualities of SSMs to process input into chunks. As a result, FlashConv can scale SSMs on SRAM GPUs to any sequence length with nearly linear computational complexity.

After several experimental evaluations, the team concluded that FlashConv set a new top speed record on the long-range arena benchmark by producing a 2x speedup. Additionally, the team scaled hybrid H3-attention language models with up to 1.3 billion parameters using FlashConv. These models excelled in most SuperGLUE benchmark tasks using zero and few-hit learning. Scaling up SSMs to larger sizes is one potential approach, the researchers concluded. The researchers look forward to continuing to combine the complementary qualities of SSMs and mindfulness in their future work. This was mainly due to their performance gains over the pure H3 model and transformers by simply combining two layers of attention to H3. Researchers are eager to investigate more sophisticated designs for combining SSMs.

Check Paper. All credit for this research goes to the researchers of this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.

Khushboo Gupta is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing and web development. She likes to learn more about the technical field by participating in several challenges.



Font Size
lines height