Fastformer: Additive Attention is All You Need
Abstract
Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pairwise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.
1 Introduction
Transformer vaswani2017attention and their variants have achieved great success in many fields. For example, Transformer is the backbone architecture of many stateoftheart pretrained language models in NLP, such as BERT devlin2019bert and GPT radford2019language. Transformer also shows great promises in visionrelated tasks dosovitskiy2020image. The core of a Transformer model is selfattention mechanism, which allows the Transformer to model the contexts within an input sequence parikh2016decomposable. However, since selfattention computes the dotproduct between the input representations at each pair of positions, its complexity is quadratic to the input sequence length vaswani2017attention. Thus, it is difficult for standard Transformer models to efficiently handle long input sequences tay2020efficient.
There are many methods to accelerate the Transformer model beltagy2020longformer; zaheer2020big; wang2020linformer; Kitaev2020reformer; tay2021synthesizer. For example, BigBird zaheer2020big computes sparse attention instead of a dense one. It uses a combination of local attention, global attention at certain positions and random attention between a certain number of tokens. However, sparse attention usually cannot fully model the global context wu2021hi. Linformer wang2020linformer exploits the lowrank characteristic of the selfattention matrix by computing approximated ones. It projects attention key and value into lowdimensional matrices that are independent of the sequence length. However, the approximation is in fact contextagnostic, which may weaken the context modeling ability of Transformer. In addition, these methods are not efficient enough when the input sequence length is very long.
In this paper we propose Fastformer, which is an efficient Transformer variant based on additive attention that can achieve effective context modeling in linear complexity. In Fastformer, we first use additive attention mechanism to summarize the input attention query matrix into a global query vector. Next, we model the interaction between attention key and the global query vector via elementwise product to learn global contextaware key matrix, and we further summarize it into a global key vector via additive attention. Then we use elementwise product to aggregate the global key and attention value, which are further processed by a linear transformation to compute the global contextaware attention value. Finally, we add together the original attention query and the global contextaware attention value to form the final output. We conduct extensive experiments on five benchmark datasets in various tasks, including sentiment classification, topic prediction, news recommendation and text summarization. The results demonstrate that Fastformer is much more efficient than many Transformer models and can achieve quite competitive results in long text modeling.
The contributions of this paper are summarized as follows:

We propose an additive attention based Transformer named Fastformer. To the best of our knowledge, Fastformer is the most efficient Transformer architecture.

We propose to model the interaction between global contexts and token representations via elementwise product, which can help fully model context information in an efficient way.

Extensive experiments on five datasets show that Fastformer is much more efficient than many Transformer models and can achieve competitive performance.
2 Related Work
2.1 Transformer and SelfAttention
The Transformer model is built upon multihead selfattention, which can effectively model the contexts within a sequence by capturing the interactions between the inputs at each pair of positions vaswani2017attention. An head selfattention mechanism can be formulated as follows:
(1)  
where , , are the input query, key and value matrices, is the sequence length, is the hidden dimension in each attention head, and is a linear transformation parameter matrix. The representation learned by each attention head is formulated as follows:
(2)  
where , , are learnable parameters. From this formula, we can see that the computational complexity is quadratic to the sequence length . It has become a bottleneck for Transformers to handle long sequences.
2.2 Efficient Transformer
In recent years, there are many approaches to improving the efficiency of Transformer architecture tay2020efficient. Some methods use sparse attention mechanisms to reduce the complexity of selfattention child2019generating; beltagy2020longformer; zaheer2020big; zhang2021pooling. For example, Longformer beltagy2020longformer uses sliding window attention to attend local contexts and uses global attention on a few preselected input locations to capture global contexts. BigBird zaheer2020big combines local attention, global attention at certain positions and random attention on several randomly selected token pairs. However, these methods may need a relative larger number of attended tokens to reduce the performance degradation on longer sequences, which usually leads to a limited speedup ratio.
Another way is using hashing technique to accelerate selfattention computation. For example, Reformer Kitaev2020reformer uses a multiround hashing scheme to put similar representations into same buckets when computing selfattention, which can theoretically reduce the complexity to . However, the computational complexity constant of Reformer is quite large, making it inefficient in processing common sequences with rather limited lengths.
There are also several methods that aim to reduce the computational cost by computing approximated selfattention choromanski2020masked; wang2020linformer; tay2021synthesizer. For instance, Linformer wang2020linformer projects the attention key and value into lowrank matrices to approximate the selfattention mechanism. Linear Transformer katharopoulos2020transformers uses a kernelbased formulation of selfattention and the associative property of matrix multiplication to approximate the dotproduct attention. However, these methods approximate selfattention in a contextagnostic manner, which may not be optimal for text modeling. In addition, they still bring heavy computational cost when the sequence length is very long. Different from the aforementioned methods, Fastformer uses additive attention to model global contexts and uses elementwise product to model the interaction between each input representation and global contexts, which can greatly reduce the computational cost and meanwhile effectively capture contextual information.
3 Fastformer
In this section, we introduce our Fastformer approach based on additive attention. The architecture of Fastformer is shown in Fig. 1. It first uses additive attention mechanism to summarize the query sequence into a global query vector, next models the interaction between the global query vector and attention keys with elementwise product and summarize keys into a global key vector via additive attention, then models the interactions between global key and attention values via elementwise product and uses a linear transformation to learn global contextaware attention values, and finally adds them with the attention query to form the final output. In this way, the computational complexity can be reduced to linearity, and the contextual information in the input sequence can be effectively captured. Next, we introduce the details of Fastformer in the following section.
3.1 Architecture
The Fastformer model first transforms the input embedding matrix into the query, key and value sequences. The input matrix is denoted as , where is the sequence length and is the hidden dimension. Its subordinate vectors are denoted as . Following the standard Transformer, in each attention head^{1}^{1}1Different attention heads use the same formulation but different parameters. we use three independent linear transformation layer to transform the input into the attention query, key and value matrices , which are written as , and , respectively.
Next, modeling the contextual information of the input sequence based on the interactions among attention query, key and value is a critical problem for Transformerlike architectures. In the vanilla Transformer, dotproduct attention mechanism is used to fully model the interaction between query and key. Unfortunately, its quadratic complexity makes it inefficient in long sequence modeling. A potential way to reduce the computational complexity is to summarize the attention matrices (e.g., query) before modeling their interactions. Additive attention is a form of attention mechanism that can efficiently summarize important information within a sequence in linear complexity. Thus, we first use additive attention to summarize the query matrix into a global query vector , which condenses the global contextual information in the attention query. More specifically, the attention weight of the th query vector is computed as follows:
(3) 
where is a learnable parameter vector. The global attention query vector is computed as follows:
(4) 
Then, a core problem in Fastformer is how to model the interaction between the summarized global query vector and the key matrix. There are several intuitive options, such as adding or concatenating the global query to each vector in the key matrix. However, they cannot differ the influence of the global query on different keys, which is not beneficial for context understanding. Elementwise product is an effective operation to model the nonlinear relations between two vectors wang2017item. Thus, we use the elementwise product between the global query vector and each key vector to model their interactions and combine them into a global contextaware key matrix. We denote the th vector in this matrix as , which is formulated as (the symbol means elementwise product). In a similar way, we use additive attention mechanism to summarize the global contextaware key matrix due to efficiency reasons. The additive attention weight of its th vector is computed as follows:
(5) 
where is the attention parameter vector. The global key vector is further computed as follows:
(6) 
Finally, we model the interaction between attention value matrix and the global key vector for better context modeling. Similar with the querykey interaction modeling, we also perform elementwise product between the global key and each value vector to compute a keyvalue interaction vector , which is formulated as . Motivated by the vanilla Transformer, we apply a linear transformation layer to each keyvalue interaction vector to learn its hidden representation. The output matrix from this layer is denoted as . This matrix is further added together with the query matrix to form the final output of Fastformer. Note that the output matrix from each attention head is concatenated along the hidden dimension axis.
We can see that in Fastformer, each key and value vector can interact with the global query or key vector to learn contextualized representations. By stacking multiple Fastformer layers, we can fully model contextual information. Motivated by the weight sharing techniques used in wang2020linformer, we share the value and query transformation parameters to reduce the memory cost. In addition, we share the parameters across different Fastformer layers to further reduce the parameter size and mitigate the risk of overfitting.
3.2 Complexity Analysis
In this section, we analyze the computational complexity of Fastformer. For the additive attention networks to learn global query and key vectors, their time and memory cost are both , and their total number of additional parameters is ( is the attention head number). In addition, the time cost and memory cost of elementwise product is also , the total complexity is ^{2}^{2}2Note that the complexity of linear transformations is not taken into account by many prior works, and we also do not consider their effects on computational costs., which is much more efficient than the standard Transformer whose complexity is . If the weight sharing technique is used, the total parameter size of Fastformer per layer is . Compared with Transformer with at least parameters^{3}^{3}3Including the query, key, value and output transformation matrices. The parameters in the bias term, feedforward network and layer normalization are not counted., Fastformer also uses fewer parameters. These analysis results demonstrate the theoretical efficiency of Fastformer.
4 Experiments
missingmissing 

Dataset  #Train  #Val  #Test  Avg. len.  #Class 
Amazon  40.0k  5.0k  5.0k  133.4  5 
IMDB  108.5k  13.6k  13.6k  385.7  10 
MIND  128.8k  16.1k  16.1k  505.4  18 
missingmissing 
missingmissing 

#News  161,013  #Users  1,000,000 
#Train impression  2,232,748  #Val impression  376,471 
#Test impression  2,370,727  Avg. click his. len.  37.1 
missingmissing 
missingmissing 

Dataset  #Train  #Val  #Test  Avg. doc/summary len. 
CNN/DailyMail  287.1k  13.4k  11.5k  781/56 
PubMed  108.5k  13.6k  13.6k  3016/203 
missingmissing 
missingmissing 

Methods  Amazon  IMDB  MIND  
Accuracy  MacroF  Accuracy  MacroF  Accuracy  MacroF  
Transformer  65.320.35  42.310.33  52.040.50  42.690.47  80.900.20  60.020.21 
Longformer  65.450.39  42.480.44  52.210.36  43.360.38  81.360.21  62.590.23 
BigBird  66.140.42  42.960.40  53.230.46  44.030.44  81.930.24  63.580.26 
Linformer  66.200.49  43.130.48  53.170.59  44.340.57  82.160.28  63.770.30 
Linear Transformers  66.120.42  43.040.44  53.090.47  44.300.49  82.250.23  63.810.22 
Poolingformer  66.050.44  43.000.45  53.780.51  44.520.50  82.460.24  64.100.26 
Fastformer  66.130.29  43.230.30  54.100.42  44.650.44  82.340.19  63.890.20 
missingmissing 
4.1 Datasets and Experimental Settings
We conduct extensive experiments on five benchmark datasets for different tasks. Their details are introduced as follows. The first one is Amazon he2016ups (we use the Electronics domain)^{4}^{4}4https://jmcauley.ucsd.edu/data/amazon/, which is a widely used dataset for review rating prediction. The second one is IMDB diao2014jointly.^{5}^{5}5https://github.com/nihalb/JMARS It is a benchmark dataset for movie review rating prediction. The third one is MIND wu2020mind^{6}^{6}6https://msnews.github.io/, which is a largescale English dataset for news recommendation and intelligence. We perform two tasks on this dataset, i.e., the news topic classification task based on news body and personalized news recommendation task based on the relevance between candidate news and user interests inferred from historical clicked news. The fourth one is CNN/DailyMail dataset hermann2015teaching (denoted as CNN/DM), which is a widely used benchmark dataset for text summarization. The fifth one is PubMed cohan2018discourse, which is another benchmark text summarization dataset with much longer documents lengths. The detailed statistical information of the datasets introduced above are shown in Tables 1, 2 and 3.
In our experiments, we use Glove pennington2014glove embeddings to initialize token embedding matrix. To obtain the embeddings in the classification and news recommendation tasks, we apply an additive attention network to convert the matrix output by Fastformer into an embedding. In addition, in the news recommendation task, following wu2019nrms we use Fastformer in a hierarchical way to first learn news embeddings from news titles and then learn user embeddings from the embeddings of historical clicked news. We use Adam kingma2014adam for model optimization. More detailed experimental settings are included in Appendix. We run our experiments on an Nvidia Tesla V100 GPU with 32GB memory. We repeat each experiment 5 times and report the average performance as well as the standard deviations. On the classification tasks, we use accuracy and macroF scores as performance metrics. On the news recommendation task, following wu2020mind we use AUC, MRR, [email protected] and [email protected] as the metrics. On the text summarization tasks, we use the ROUGE1, ROUGE2 and ROUGEL metrics (denoted as R1, R2 and RL) to evaluate the generated summaries.
4.2 Effectiveness Comparison
First, we compare the performance of Fastformer with many baseline methods, including:

Transformer vaswani2017attention, the vanilla Transformer;

Longformer beltagy2020longformer, a Transformer variant with sparse attention. It combines sliding window attention and global attention to model local and global contexts;

BigBird zaheer2020big, an extension of Longformer, which incorporates sparse random attention mechanism;

Linformer wang2020linformer, a Transformer variant with linear complexity, which use lowdimensional key and value matrices to compute approximated selfattention;

Linear Transformer katharopoulos2020transformers, another linear complexity Transformer using kernel functions to approximate selfattention mechanism;

Poolingformer zhang2021pooling, a hierarchical architecture that first uses sliding window selfattention to capture shortrange contexts and then uses pooling selfattention to capture longrange contexts.
The performance of these methods on the three classification datasets are compared in Table LABEL:table.performance. From the results, we find that efficient Transformer variants usually outperform the standard Transformer model. This is because the quadratic computational cost of vanilla Transformer limits the maximum sequence length can be handled, and many useful contexts are lost when truncating the input text sequence. Fastformer can achieve competitive or better performance than other efficient Transformer variants in both long and short text modeling. This is because Fastformer can effectively model global contexts and their relationship to different tokens, which can help understand context information accurately.
missingmissing 

Methods  AUC  MRR  [email protected]  [email protected] 
NRMS  68.18  33.29  36.31  42.20 
FIM  68.31  33.42  36.47  42.35 
PLMNR  70.64  35.39  38.71  44.38 
Transformer  68.22  33.32  36.35  42.23 
Longformer  67.98  33.04  36.18  42.06 
BigBird  68.14  33.28  36.30  42.18 
Linformer  68.02  33.19  36.22  42.10 
Linear Transformers  67.76  32.94  36.16  41.97 
Poolingformer  68.54  33.60  36.69  42.60 
Fastformer  69.11  34.25  37.26  43.38 
Fastformer+PLMNR  71.04  35.91  39.16  45.03 
Fastformer+PLMNR*  72.68  37.45  41.51  46.84 
missingmissing 
We also compare the performance of different methods in the news recommendation task. We add three recent news recommendation methods to the comparison, including: (1) NRMS wu2019nrms, which uses multihead selfattention networks to learn news and user representations; (2) FIM wang2020fine, a finegrained interest matching method for personalized news recommendation; (3) PLMNR wu2021plm, empowering news recommendation with pretrained language models. In PLMNR, we use the best performed UniLM bao2020unilmv2 empowered model. In addition, we explore to replace the user encoder in PLMNR with Fastformer. The results are shown in Table 5. We can see that among different Transformer architectures, Fastformer achieves the best performance, and it also outperforms its basic NRMS model. In addition, Fastformer can further improve the performance of PLMNR, and the ensemble model achieves the best results on the MIND leaderboard^{7}^{7}7https://msnews.github.io/. These results show that Fastformer is not only effective in text modeling, but also effective in understanding user interest.
missingmissing 

Method  CNN/DailyMail  PubMed  
R1  R2  RL  R1  R2  RL  
Transformer  38.52  16.04  35.87  34.26  11.88  31.64 
Longformer  37.89  15.46  35.19  36.92  14.34  33.75 
BigBird  38.31  15.78  35.60  37.73  14.99  34.51 
Linformer  37.96  15.58  35.34  37.22  14.48  34.02 
Linear Transformer  37.24  14.87  34.64  36.43  13.80  33.21 
Poolingformer  38.58  16.16  36.17  37.82  15.15  34.63 
Fastformer  38.54  16.22  36.21  38.09  15.44  34.81 
missingmissing 
We further conduct experiments on the text summarization tasks to verify the effectiveness of Fastformer in natural language generation. The results are shown in Table 6. We find that on the CNN/DM dataset, many efficient Transformer variants (except Poolingformer and Fastformer) are inferior to the vanilla Transformer. This is because sparse attention based method such as Longformer and BigBird cannot fully model the document contexts, and approximated selfattention based methods such as Linformer and Linear Transformer cannot effectively consider context information in the approximation. Since the summaries in the CNN/DM dataset have short lengths, they may be less effective than vanilla Transformer when the same sequence length is used. Fastformer can achieve the best performance in most metrics, which shows the advantage of Fastformer in natural language generation.
missingmissing 

Method  Complexity 
Transformer  
Longformer  
BigBird  
Linformer  
Linear Transformer  
Poolingformer  
Fastformer  
missingmissing 
4.3 Efficiency Comparison
In this section, we evaluate the efficiency of different methods. We first compare the theoretical computational complexity of these methods in Table 7.^{8}^{8}8We exclude the complexity of the linear Transformation applied after the input and before the output. The complexity of vanilla Transformer is , while other compared methods have linear complexity with respect to the input sequence length. However, the complexity of Longformer and BigBird depends on the average number of tokens to be attended by each token, Linformer has a complexity depending on the square of the approximation error, the complexity of Linear Transformer has a quadratic term of hidden dimension, and the complexity of Poolingformer depends on its window size. Different from them, the complexity of Fastformer only depends on the sequence length and the hidden dimension, and it has the least complexity among compared methods. This result shows that Fastformer is efficient in theory.
Then, we conduct experiments to measure the real training and inference cost of different methods.^{9}^{9}9We use the model configurations in the news topic classification task. Motivated by prior works katharopoulos2020transformers; wang2020linformer, we vary the sequence length from 128 to 65535 and scale the batch size inversely with the sequence length. We generate pseudo samples with random tokens and fix the token embeddings to better measure the computational cost of different methods. The results are shown in Fig. 2.^{10}^{10}10The maximum sequence length of Transformer is limited by the GPU memory. We find that Transformer is inefficient when the sequence length is relatively long (e.g., 512). In addition, we find that although Poolingformer has linear complexity in theory, it is inefficient in practice. This is because it uses a large window size (e.g., 256) to compute pooling weight in a convolutionlike way, which leads to a very large constant term of the computational cost. Besides, Fastformer is much more efficient than other linear complexity Transformer variants in terms of both training and inference time. These results verify the efficiency of Fastformer.
4.4 Influence of Interaction Function
Next, we study the influence of using different functions to model the interactions among query, key and value in Fastformer. We compare the performance of Fastformer and its variants using the add or concatenation functions to combine the global query/key vector with the vectors in the key/value matrix. The results are shown in Fig. 3. We find concatenating is not a good option for Fastformer. This is because simply concatenating two vectors cannot consider the interactions between them. In addition, we find adding is not optimal. This is because the add function can only model the linear interactions between two vectors, which may also be insufficient to learn accurate context representations. Different from concatenation and add, elementwise product can model the nonlinear interactions between two variables, which may help model the complex contexts in long sequences.
4.5 Influence of Parameter Sharing
Then, we study the influence of different parameter sharing techniques on the performance of Fastformer, including sharing query and value transformation matrices, sharing the parameters across different attention heads, and sharing parameters across different layers.^{11}^{11}11We do not observe significant differences in terms of time cost when using different parameter sharing methods, and we only compare the performance here. The results are shown in Fig. 4. We find that using queryvalue parameter sharing can achieve similar or slightly better performance than the Fastformer model without any parameter sharing techniques. Thus, it is favorable to reduce the parameter size by sharing the query and value transformation matrix. In addition, we find headwise parameter sharing will lead to notable performance drops. This is because different attention heads are expected to capture different patterns of contexts, and sharing their parameters is not beneficial for context modeling. Moreover, we find incorporating layerwise sharing method can further improve the model performance. This is because parameter sharing among different layers can mitigate the risk of overfitting Lan2020albert. Thus, in Fastformer we incorporate both queryvalue sharing and layerwise sharing strategies to improve the model performance and meanwhile reduce the model parameter size.
5 Conclusion and Future Work
In this paper, we propose Fastformer, which is a Transformer variant based on additive attention that can handle long sequences efficiently with linear complexity. In Fastformer, we first use additive attention to summarize the query matrix into a global query vector. Next, we combine it with each key vector via elementwise product to learn global contextaware key matrix, and we further summarize it into a global key vector via additive attention. We then model the interactions between global contextaware key and the value to learn global contextaware attention value, which is further combined with the query to form the final output. Extensive experiments on five benchmark datasets show that Fastformer is much more efficient than many existing Transformer models and meanwhile can achieve competitive or even better performance in long text modeling.
In our future work, we plan to pretrain Fastformerbased language models to better empower NLP tasks with long document modeling. In addition, we will explore applying Fastformer to other scenarios such as ecommerce recommendation and Ads CTR prediction to improve user modeling based on long user behavior sequences.
References
Appendix A Appendix
a.1 Experimental Environment
Our experiments are conducted on a cloud Linux server with Ubuntu 16.04 operating system. The codes are written in Python 3.6 using the Keras library 2.2.4 with Tensorflow 1.12 backend. The GPU type is Nvidia Tesla V100 with 32GB GPU memory. Each experiment is run by a single thread.
a.2 Preprocessing
In our experiments, we use the NLTK tool to preprocess the texts. We use the word_tokenize and function to convert the input texts into token sequences. The word embeddings of outofvocabulary words are filled with random vectors that have the same mean and covariation values as other words.
a.3 Hyperparameter Settings
The detailed hyperparameter settings on each dataset used in this paper are listed in Table 8.
Method  Amazon  IMDB 


CNN/DailyMail  PubMed  
# Encoder Layer  2  2  2  1  4  2  
# Decoder Layer          4  4  
# Global Token  8  8  16  8  64  64  
# Random Token  8  8  16  8  64  64  
Window size (Longformer, BigBird)  8  8  16  8  64  64  
Window size (Poolingformer)  64  64  256  64  256  256  
Block length (BigBird)  4  4  8  4  32  32  
Projection dimension (Linformer)  16  16  16  16  64  64  
# Attention Head  16  16  16  16  16  16  
Beam Size          5  5  
Hidden dimension  256  256  256  256  256  256  
Loss  Crossentropy  Crossentropy  Crossentropy  Crossentropy  Crossentropy  Crossentropy  
Batch Size  64  64  64  64  32  32  
Optimizer  Adam  Adam  Adam  Adam  Adam  Adam  
Learning Rate  1e3  1e3  1e3 

1e4  1e4  
Max Epochs  3  3  3  3  12  15  
Dropout  0.2  0.2  0.2  0.2  0.2  0.2 