LONGNET: Scaling Transformers to 1,000,000,000 Tokens

A Leap Forward in AI’s Ability to Process Long Sequences

Introduction

Just a few days ago, I penned down my thoughts on the rapid pace of evolution in the AI landscape. I mused about how a blog written on a Sunday might need a refresh by Monday; such is the speed at which this field advances. But I admit I didn’t see this coming: 1 billion tokens.

This led to a lively debate with some of my colleagues and friends about the velocity of this technological disruption. I firmly believed that the growth would be exponential, while others argued for a more gradual progression.

I’m not referring to the speed at which AI will become an integral part of our daily lives. I can’t fully grasp the extent of the disruptions we’re about to experience. What I can assert, however, is that this blog wouldn’t exist without AI (a tip of the hat to the Elefant in the room, and AI that powers this blog) and that AI has already started to influence my everyday life significantly.

It’s not exponential … it’s way faster … 1 billion tokens

However, I underestimated the rate of this exponential acceleration. When GPT-2 was launched in 2019, it could handle 512 tokens. By 2020, this number had increased to 2048, and today’s models can process up to 16k tokens. With just a few more iterations, you could potentially write an entire book with a single prompt and a little assistance from ChatGPT or one of its counterparts.

As I’ve already admitted, my initial prediction was off the mark.

Just yesterday, a fellow explorer, Nik, sent me a link to a video by David Shapiro, who introduced a newly published paper with the intriguing title, “LONGNET: Scaling Transformers to 1,000,000,000 Tokens“. This paper, courtesy of Microsoft, details their journey from 16k tokens to an astounding 1 billion tokens.AI Leadership - LONGNET: Scaling Transformers to 1,000,000,000 Tokens. This paper, courtesy of Microsoft, details their journey from 16k tokens to an astounding 1 billion tokens. Share on X

This leap is so monumental that it transcends the notion of merely “pushing the boundaries of what’s possible”. This isn’t just exponential growth; it’s akin to hitting a wall (a nod to Pink Floyd, but that’s a story for another day).

So, let’s delve into this groundbreaking development from Microsoft Research: LONGNET, a Transformer variant that can handle sequence lengths of over 1 billion tokens without compromising performance on shorter sequences.

Why going from 512 to 16k is already phenomenal.

Before we dive into the details of LONGNET, let’s take a moment to appreciate the journey from 512 to 16k tokens, which in itself is a feat akin to travelling at light speed. What does this evolution entail? What are the mathematical principles behind it, and what challenges does such a leap pose for the technology that makes it possible?

The jump in sequence length from 512 to 16,000 tokens isn’t a straightforward linear progression. It brings about significant computational challenges. The attention mechanism in these models, which enables them to focus on specific parts of the input when generating an output, functions in a way where every token attends to every other token. This implies that the computational complexity grows at a quadratic rate as the number of tokens increases.

To express this in mathematical terms, if ‘n’ is the number of tokens, the computational complexity of the attention mechanism in a standard Transformer is O(n^2), where O represents the Big O notation used to describe the performance or complexity of an algorithm. 

This means that if you double the number of tokens, the computational cost quadruples. If you increase the number of tokens tenfold, the computational cost surges a hundredfold. 

If you increase the number of tokens from 512 to 16000 (a multiplication by approximately 32), the “computational cost” skyrockets a thousandfold (since 32² equals 1024).

image from 2307.02486.pdf (arxiv.org)

So, what’s the secret behind this seemingly miraculous feat?

The Power of Dilated Attention

The secret sauce behind LONGNET’s remarkable capabilities is a technique known as dilated attention. This innovative method allows the model to focus on a subset of input tokens, exponentially expanding its attentive field as the distance from the token increases. The result is a model that maintains a global perspective of the input while significantly reducing the computational cost.

To shed more light on this, I turned to Bing, Bard, and ChatGPT for a more detailed explanation of this sparse dilated attention and how it differs from the traditional way of handling information and tokens.

In conventional language models, a process known as “dense” attention is employed, where every piece of information, or token, in a sequence, is considered in relation to every other token. While this approach is thorough, it significantly increases computational complexity as the sequence length grows.

This is where the new approach, known as sparse attention, comes into play. Sparse attention allows each token to focus on only a subset of the other tokens, significantly reducing the computational burden. This means that instead of each token having to consider every other token in the sequence, it only needs to consider a smaller, specific set of other tokens.

This approach enables the model to handle much larger sequences, up to 1 billion tokens, while still maintaining a manageable computational load. The implementation of sparse attention can vary, with different methods used to determine which tokens each token attends to. Some methods use fixed patterns, while others might adapt the attention pattern based on the data.

AI Leadership - LONGNET: Sparse attention enables Large Language Models to handle much larger sequences, up to 1 billion tokens Share on X

Comparing it with how humans process vast amounts of information is helpful to fully grasp this concept. Our brains do not remember every single detail of our experiences. Instead, they extract the essential parts and forget the rest. This is a form of sparse representation where the brain focuses on the essential elements and discards the rest. An interesting book suggested by David Shapiro, “The Forgetting Machine” by Quian Quiroga, delves into this concept.

And when will these 1 billion tokens be usable?

Naturally, I couldn’t help but ask our AI companions: when will this groundbreaking development transition from theory to reality? To get an answer, I shared the link to the Microsoft paper with them. This time, it was Bard who provided the most insightful response, and I’d like to share his perspective with you (copy of Bard’s text):

Yes, the text of the paper you linked suggests that the 1B tokens model is already practically usable. The authors of the paper report that the model achieved state-of-the-art results on various natural language processing tasks, including machine translation, text summarization, and question answering. They also note that the model was able to scale to very long sequences, which is a significant challenge for many other language models.

However, the authors also acknowledge that the 1B tokens model is not without its limitations. For example, the model can be computationally expensive to train and run. Additionally, the model can sometimes generate grammatically incorrect or semantically incoherent outputs.

Overall, the text of the paper suggests that the 1B tokens model is a promising new development in natural language processing. The model has the potential to be used in a variety of practical applications, but it is important to be aware of its limitations.

Here are some specific examples from the paper that suggest that the 1B tokens model is already practically usable:

  • • The model was able to achieve a BLEU score of 38.5 on the WMT 2014 English-to-German translation task, which is a state-of-the-art result.
  • • The model was able to achieve a ROUGE-L score of 47.1 on the CNN/Daily Mail summarization task, which is also a state-of-the-art result.
  • • The model was able to answer 89.8% of the questions correctly on the SQuAD 1.1 question answering benchmark, which is a significant improvement over previous models.

The authors of the paper also note that the 1B tokens model was able to scale to very long sequences, such as the full text of Shakespeare’s Hamlet. This is a significant challenge for many other language models, which often struggle to process sequences longer than a few hundred words.

Overall, the text of the paper suggests that the 1B tokens model is a promising new development in natural language processing. The model has the potential to be used in a variety of practical applications, but it is important to be aware of its limitations.

Conclusion

In retrospect, my prediction of exponential evolution fell short of the actual technical advancements unfolding before us. It took approximately 60 years from the first mention of the term’ Artificial Intelligence’ in 1956 at Dartmouth to reach the 16k token mark. However, the leap from 16k to 1 billion tokens is so monumental that it feels like hitting a wall, rendering any mathematical calculations of this growth almost meaningless.

Meanwhile, humans are still embroiled in our own quarrels, barely having descended from our proverbial trees on an evolutionary scale. Yet, experts predict that AGI (Artificial General Intelligence) is just 5 to 10 years away.

It’s fascinating to juxtapose these two timelines as a foundation for discussing the impact of the coming years on our evolution.

The 1 billion tokens AI would be able to read the entire legacy of Shakespeare, comprising 884,647 words (according to Historyinnumbers), in approximately 1/1000 of a second. In contrast, reading the paper describing how it achieves this feat took me significantly longer.AI Leadership - LONGNET: The 1 billion tokens AI would be able to read the entire legacy of Shakespeare, comprising 884,647 words, in approximately 1/1000 of a second. In contrast, reading the paper describing how it achieves this ... Share on X

image from 2307.02486.pdf (arxiv.org)

In my next blog, I’ll circle back to how these advancements can be harnessed in our daily lives and how they can be tremendously beneficial for our personal journeys and the path we’re paving for our world.

Yours in AI

Luc

THANK YOU

A heartfelt thank you to Nik Silverans, one of my comrades in this captivating and awe-inspiring exploration of the AI universe, for alerting me to this seismic shift in the landscape of disruption. May the force (and the necessary Gigabytes) be with you!

ORIGINAL PAPER

LONGNET: Scaling Transformers to 1,000,000,000 Tokens
Jiayu Ding∗ Shuming Ma∗ Li Dong Xingxing Zhang Shaohan Huang Wenhui Wang Furu Wei† Microsoft Research https://aka.ms/GeneralAI

TL;DR

In the realm of AI, we’re witnessing a revolution that’s far beyond exponential. The LONGNET model, a brainchild of Microsoft, has scaled Transformers to handle a staggering 1 billion tokens.

This leap is powered by a technique called dilated attention, which allows the model to focus on a subset of input tokens, thereby significantly reducing computational cost while maintaining a global view of the input.

This approach, known as sparse attention, is a game-changer, enabling the model to handle sequences up to 1 billion tokens.

The implications are profound. For instance, this AI can read the entire legacy of Shakespeare in 1/1000 of a second.

But it’s not just about speed; it’s about the potential applications in our daily lives and the transformative journey we’re embarking on.

As we stand on the brink of this new era, it’s time to ponder the impact of these advancements on our evolution. Buckle up, my friends, we’re in for a thrilling ride in the world of AI!