Long Sequences Transformers: a review of the SOTA
Published: on by Achraff Adjileye
A lot of work has been done on processing long documents, lifting the limitation encountered by BERT-like models which are only capable of processing sequences up to 512 tokens. This has led to the release of several variants of these models to process long documents. The main idea of most of these models is to make the attention mechanism of Transformer (see Attention is all you need) scale linearly with the input sequence length instead of quadratically, in terms of time and memory complexity.
In the following, we’ll present chronologically various models released in the literature to make Transformer-based models scale better with long input sequence length, with minimum details on each model and useful references in order to facilitate literature review for anyone interested in the subject. The models are grouped by family of approaches used.