Long Sequences Transformers: a review of the SOTA
Published: on by Achraff Adjileye
A lot of work has been done on processing long documents, lifting the limitation encountered by BERT-like models which are only capable of processing sequences up to 512 tokens. This has led to the release of several variants of these models to process long documents. The main idea of most of these models is to make the attention mechanism of Transformer (see Attention is all you need) scale linearly with the input sequence length instead of quadratically, in terms of time and memory complexity.